Lecture Notes in 
Computer Science 1842 



David Vernon (Ed.) 



Computer Vision - 
ECCV 2000 



6th European Conference 
on Computer Vision 
Dublin, Ireland, June/July 2000 
Proceedings, Part I 




ECCV 2OO0 




springer 




Lecture Notes in Computer Science 1842 

Edited by G. Goes, J. Hartmanis and J. van Leeuwen 




Springer 

Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Singapore 

Tokyo 




David Vernon (Ed.) 



Computer Vision - 
ECCV 2000 



6th European Conference on Computer Vision 
Dublin, Ireland, June 26 - July 1, 2000 
Proceedings, Part I 




Springer 




Series Editors 



Gerhard Goos, Karlsruhe University, Germany 
Juris Hartmanis, Cornell University, NY, USA 
Jan van Leeuwen, Utrecht University, The Netherlands 



Volume Editor 
David Vernon 

5 Edwin Court, Glenageary, Co. Dublin, Ireland 
E-mail: vernon@ieee.org 



Cataloging-in-Publication data applied for 
Die Deutsche Bibliothek - CIP-Einheitsaufnahme 

Computer vision : proceedings / ECCV 2000, 6th European Conference on 
Computer Vision Dublin, Ireland, June 26 - July 1, 2000. David 
Vernon (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; 
London ; Milan ; Paris ; Singapore ; Tokyo : Springer 
Pt. 1 . - (2000) 

(Lecture notes in computer science ; Vol. 1842) 

ISBN 3-540-67685-6 



CR Subject Classification (1998): 1.4, 1.3.5, 1.5, 1.2.9-10 
ISSN 0302-9743 

ISBN 3-540-67685-6 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag is a company in the BertelsmannSpringer publishing group. 

© Springer-Verlag Berlin Heidelberg 2000 
Printed in Germany 

Typesetting: Camera-ready by author 

Printed on acid-free paper SPIN 10718972 06/3142 5 4 3 2 1 0 




Preface 



Ten years ago, the inaugural European Conference on Computer Vision was 
held in Antibes, France. Since then, ECCV has been held biennially under the 
auspices of the European Vision Society at venues around Europe. This year, 
the privilege of organizing ECCV 2000 falls to Ireland and it is a signal honour 
for us to host what has become one of the most important events in the calendar 
of the computer vision community. 

ECCV is a single-track conference comprising the highest quality, previously 
unpublished, contributed papers on new and original research in computer vision. 
This year, 266 papers were submitted and, following a rigorous double-blind 
review process, with each paper being reviewed by three referees, 116 papers 
were selected by the Programme Committee for presentation at the conference. 

The venue for ECCV 2000 is the University of Dublin, Trinity College. Fo- 
unded in 1592, it is Ireland’s oldest university and has a proud tradition of 
scholarship in the Arts, Humanities, and Sciences, alike. The Trinity campus, 
set in the heart of Dublin, is an oasis of tranquility and its beautiful squares, 
elegant buildings, and tree-lined playing-fields provide the perfect setting for any 
conference. 

The organization of ECCV 2000 would not have been possible without the 
support of many people. In particular, I wish to thank the Department of Com- 
puter Science, Trinity College, and its head, J. G. Byrne, for hosting the con- 
ference secretariat. Gerry Lacey, Damian Gordon, Niall Winters, Mary Murray, 
and Dermot Furlong provided unstinting help and assistance whenever it was 
needed. Sarah Campbell and Tony Dempsey in Trinity’s Accommodation Office 
were a continuous source of guidance and advice. I am also indebted to Michael 
Nowlan and his staff in Trinity’s Information Systems Services for hosting the 
ECCV 2000 web-site. I am grateful too to the staff of Springer- Verlag for always 
being available to assist with the production of these proceedings. There are 
many others whose help ~ and forbearance - I would like to acknowledge: my 
thanks to all. 

Support came in other forms too, and it is a pleasure to record here the kind 
generosity of the University of Freiburg, MV Technology Ltd., and Captec Ltd., 
who sponsored prizes for best paper awards. 

Finally, a word about conferences. The technical excellence of the scientific 
programme is undoubtedly the most important facet of ECCV. But there are 
other facets to an enjoyable and productive conference, facets which should en- 
gender conviviality, discourse, and interaction; my one wish is that all delegates 
will leave Ireland with great memories, many new friends, and inspirational ideas 
for future research. 
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Abstract. In recent years several techniques have been proposed for modelling the 
low-dimensional manifolds, or ‘subspaces’, of natural images. Examples include 
principal component analysis (as used for instance in ‘eigen-faces’), independent 
component analysis, and auto-encoder neural networks. Sucb methods suffer from 
a number of restrictions such as the limitation to linear manifolds or the absence of 
a probablistic representation. In this paper we exploit recent developments in the 
fields of variational inference and latent variable models to develop a novel and 
tractable probabilistic approach to modelling manifolds which can handle complex 
non-linearities. Our framework comprises a mixture of sub-space components in 
which both the number of components and the effective dimensionality of the sub- 
spaces are determined automatically as part of the Bayesian inference procedure. 
We illustrate our approach using two classical problems: modelling the manifold 
of face images and modelling the manifolds of hand-written digits. 



1 Introduction 

Interest in image subspace modelling has grown considerably in recent years in contexts 
such as recognition, detection, verification and coding. Although an individual image 
can be considered as a point in a high-dimensional space described by the pixel values, 
an ensemble of related images, for example faces, lives on a (noisy) non-linear manifold 
having much a much lower intrinsic dimensionality. One of the simplest approaches to 
modelling such manifolds involves finding the principal components of the ensemble of 
images, as used for example in ‘eigen-faces’ 

However, simple principal component analysis suffers from two key limitations. 
First, it does not directly define a probability distribution, and so it is difficult to use 
standard PCA as a natural component in a probabilistic solution to a computer vision 
problem. Second, the manifold defined by PCA is necessarily linear. Techniques which 
address the first of these problems by constructing a density model include Gaussians 
and mixtures of Gaussians fQ . The second problem has been addressed by considering 
non-linear projective methods such as principal curves and auto-encoder neural networks 
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m- Bregler and Omohundro o and Heap and Hogg |5'| use mixture representations 
to try to capture the non-linearity of the manifold. However, their model htting is based 
on simple clustering algorithms (related to X-means) and lacks the fully probabilistic 
approach as discussed in this paper. 

A central problem in density modelling in high dimensional spaces concerns model 
complexity. Models fitted using maximum likelihood are particularly prone to severe 
over-fitting unless the number of free parameters is restricted to be much less than the 
number of data points. For example, it is clearly not feasible to ht an unconstrained 
mixture of Gaussians directly to the data in the original high-dimensional space using 
maximum likelihood due to the excessive number of parameters in the covariance ma- 
trices. Moghaddam and Pentland [O therefore project the data onto a PCA sub-space 
and then perform density estimation within this lower dimensional space using Gaussian 
mixtures. While this limits the number of free parameters in the model, the non-linearity 
of the manifold requires the PCA space to have a significantly higher dimensionality 
than that of the manifold itself, and so again the model is prone to over-parameterization. 

One important aspect of model complexity concerns the dimensionality of the ma- 
nifold itself, which is typically not known in advance. Moghaddam II III , for example, 
arbitrarily fixes the model dimensionality to be 20. 

In this paper we present a sophisticated Bayesian framework for modelling the ma- 
nifolds of images. Our approach constructs a probabilistically consistent density model 
which can capture essentially arbitrary non-linearities and which can also discover an 
appropriate dimensionality for modelling the manifold. A key feature is the use of a 
fully Bayesian formulation in which the appropriate model complexity, and indeed the 
dimensionality of the manifold itself, can be discovered automatically as part of the 
inference procedure O. The model is based on a mixture of components each of which 
is a latent variable model whose dimensionality can be inferred from the data. It avoids 
a discrete model search over dimensionality , involving instead the use of continuous 
hyper-parameters to determine an effective dimensionality for the components in the 
mixture model. 

Our approach builds on recent developments in latent variable models and variational 
inference. In Section|2|we describe the probabilistic model, and in SectionQwe explain 
the variational framework used to fit it to the data. Results from face data and from images 
of hand-written digits are presented in Section 0]and conclusions given in Section|3 

Note that several authors have explored the use of non-linear warping of the image, 
for example in the context of face recognition, in order to take account of changes of 
pose or of interpersonal variation 14161711 . In so far as such distortions can be accura- 
tely represented, these transformations should be of significant benefit in tackling the 
subspace modelling problem, albeit at increased computational expense. It should be em- 
phasised that such approaches can be used to augment virtually any sub-space modelling 
algorithm, including those discussed in this paper, and so they will not be considered 
further. 
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2 Models for Manifolds 

Our approach to modelling the manifolds of images builds upon recent developments 
in latent variable models and can be seen as a natural development of PCA and mixture 
modelling frameworks leading to a highly flexible, fully probabilistic framework. We 
begin by showing how conventional PCA can be reformulated probabilistically and 
hence used as the component distribution in a mixture model. Then we show how a 
Bayesian approach allows the model complexity (including the number of components 
in the mixture as well as the effective dimensionality of the manifold) to be inferred 
from the data. 



2.1 Maximum Likelihood PCA 

Principal component analysis (PCA) is a widely used technique for data analysis. It can 
be defined as the linear projection of a data set into a lower-dimensional space under 
which the retained variance is a maximum, or equivalently under which the sum-of- 
squares reconstruction cost is minimized. 

Consider a data set D of observed d-dimensional vectors D = {t„} where n G 
{!,... ,N}. Conventional PCA is obtained by first computing the sample covariance 
matrix given by 



1 

( 1 ) 

n— 1 

where t = N~^ sample mean. Next the eigenvectors and eigenvalues 

Xi of S are found, where Su^ = A^Uj and i = 1, . . . , d. The eigenvectors corresponding 
to the q largest eigenvalues (where q < d) are retained, and a reduced-dimensionality 
representation of the data set is defined by x„ = Uj(t„ — t) where Ug = (ui, . . . ,Ug). 

A significant limitation of conventional PCA is that it does not define a probability 
distribution. Recently, however. Tipping and Bishop fT^ showed how PCA can be 
reformulated as the maximum likelihood solution of a specific latent variable model, as 
follows. We first introduce a g-dimensional latent variable x whose prior distribution 
is a zero mean Gaussian P(x) = A/^(0,Iq) and is the q-dimensional unit matrix. 
The observed variable t is then defined as a linear transformation of x with additive 
Gaussian noise t = Wx + + e where W is a d x q matrix, /r is a d-dimensional 

vector and e is a zero-mean Gaussian-distributed vector with covariance T~^Id (where r 
is an inverse variance, often called the ‘precision’). Thus f’(t|x) = Af(Wx+ fJ., 

The marginal distribution of the observed variable is then given by the convolution of 
two Gaussians and is itself Gaussian 

P(t) = J P(t|x)P(x) dx = A/’(/r, C) (2) 

where the covariance matrix C = WW^ + T~^Id- The model O) represents a con- 
strained Gaussian distribution governed by the parameters /x, W and r. 
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It was shown by Tipping and Bishop M that the stationary points of the log like- 
lihood with respect to W satisfy 

Wml = (3) 

where the columns of Ug are eigenvectors of S, with corresponding eigenvalues in the 
diagonal matrix Ag. It was also shown that the maximum of the likelihood is achieved 
when the q largest eigenvalues are chosen, so that the columns of Ug correspond to the 
principal eigenvectors, with all other choices of eigenvalues corresponding to saddle 
points. The maximum likelihood solution for r is then given by 



1 

tml 






(4) 



which has a natural interpretation as the average variance lost per discarded dimension. 
The density model O) thus represents a probabilistic formulation of PCA. It is easily 
verified that conventional PCA is recovered by treating r as a parameter and taking the 
limit T ^ oo. 

Probabilistic PCA has been successfully applied to problems in data compression, 
density estimation and data visualization, and has been extended to mixture and hierar- 
chical mixture models iin™ . As with conventional PCA, however, the model itself 
provides no mechanism for determining the value of the latent-space dimensionality q. 
For q = d—1 the model is equivalent to a full-covariance Gaussian distributiorQ, while 
for q < d — 1 it represents a constrained Gaussian in which the variance in the remaining 
d— q directions is modelled by the single parameter t. Thus the choice of q corresponds 
to a problem in model complexity optimization. In principal cross-validation to compare 
all possible values of q offers a possible approach. However, maximum likelihood esti- 
mation is highly biased (leading to ‘overfitting’) and so in practice excessively large data 
sets would be required and the procedure would become computationally intractable. 



2.2 Bayesian PCA 

The issue of model complexity can be handled naturally within a Bayesian paradigm. 
Armed with the probabilistic reformulation of PCA defined in Section a Bayesian 
treatment of PCA is obtained by first introducing prior distributions over the parame- 
ters p, W and T. A key goal is to control the effective dimensionality of the latent 
space (corresponding to the number of retained principal components). Furthermore, 
we seek to avoid discrete model selection and hence we introduce continuous hyper- 
parameters to determine automatically an appropriate effective dimensionality for the 
latent space as part of the process of Bayesian inference. This is achieved by introdu- 
cing a hierarchical prior P(W|a) over the matrix W, governed by a g-dimensional 

* This follows from the fact that the q — 1 linearly independent columns of W have independent 
variances along q — I directions, while the variance along the remaining direction is controlled 
by T. 



Non-linear Bayesian Image Modelling 



7 



vector of hyper-parameters a = {ai, . . . jCiq}. Each hyper-parameter controls one of 
the columns of the matrix W through a conditional Gaussian distribution of the form 

P(W|q:) = ^ exp I (5) 

^■=1 ^ 



where {wi} are the columns of W. This form of prior is motivated hy the framework of 
automatic relevance determination (ARD) introduced in the context of neural networks 
by Neal and Mac Kay (see Mac Kay, 1995). Each controls the inverse variance of the 
corresponding w^, so that if a particular ai has a posterior distribution concentrated at 
large values, the corresponding will tend to be small, and that direction in latent 
space will be effectively ‘switched off’. The dimensionality of the latent space is set to 
its maximum possible value q = d — 1. 

We complete the specification of the Bayesian model by defining the remaining 
priors to have the form 



P(/x)=AT(/x|0,/3-'l) 


(6) 


Q 

-P(«) = n^(ai|aa,6a) 

i—1 


(7) 


P(r) = r{T\Cr,dr). 


(8) 



Here A/^(x|m, S) denotes a multivariate normal distribution over x with mean m and 
covariance matrix S. Similarly, F{x\a, b) denotes a Gamma distribution over x given 
by 



T(a;|a, b) 



ba^a-le-bx 

na) 



(9) 



where r{a) is the Gamma function. We obtain broad priors by setting Oa = ba = Ut = 
br = 10-3 and [3 = 

As an illustration of the role of the hyperparameters in determining model complexity, 
we consider a data set consisting of 300 points in 10 dimensions, in which the data is 
drawn from a Gaussian distribution having standard deviation 1.0 in 3 directions and 
standard deviation 0.5 in the remaining 7 directions. The result of fitting both maximum 
likelihood and Bayesian PCA models is shown in Figure Q] (The Bayesian model was 
trained using the variational approach discussed in SectionOl) In this case the Bayesian 
model has an effective dimensionality of qeff = 3 as expected. 



2.3 Mixtures of Bayesian PCA Models 

Given a probabilistic formulation of PCA we can use it to construct a mixture distribution 
comprising a linear superposition of principal component analyzers. If we were to fit 
such a model to data using maximum likelihood we would have to choose both the 
number M of components and the latent space dimensionality q of the components. 
For moderate numbers of components and data spaces of several dimensions it quickly 
becomes computationally costly to use cross-validation. 
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Fig. 1. Hinton diagrams of the matrix W for a data set in 10 dimensions having m = 3 directions 
with larger variance than the remaining 7 directions. The area of each square is proportional to the 
magnitude of the corresponding matrix element, and the squares are white for positive values and 
black for negative values. The left plot shows Wml from maximum likelihood PCA while the 
right plot shows the posterior mean ( W) from the Bayesian approach, showing how the model is 
able to discover the appropriate dimensionality by suppressing the 6 surplus degrees of freedom. 



Here Bayesian PCA offers a significant advantage in allowing the effective dimen- 
sionalities of the models to be determined automatically. Furthermore, we also wish to 
determine the appropriate number of components in the mixture. We do this by Bayesian 
model comparison la as an integral part of the learning procedure as discussed in the 
next section. 

To formulate the probabilistic model we introduce, for each data point t„, an additio- 
nal M -dimensional binary latent variable s„ which has one non-zero element denoting 
which of the M components in the mixture is responsible for generating t„. These di- 
screte latent variables have distributions governed by hyperparameters tt = {tt^} where 
m = 1, . . . , M, 

P{s = Sm\T^) = TTm (10) 

where 6m denotes a vector with all elements zero except element m whose value is 1 . 
The parameters tt are given a Dirichlet distribution 

M / M \ 

P{tt) = Dir(7r|u) = ^ tt, - 1 j (11) 

with u are parameters of the distribution, and Z{vl) is the normalization constant. 

In a simple mixture of Bayesian PCA models, each component would be free to 
determine its own dimensionality. A central goal of this work, however, is to model a 
continuous non-linear manifold. We therefore wish the components in the mixture to 
have a common dimensionality whose value is a-priori unknown and which should be 
inferred from the data. This can be achieved within our framework by using a single set 
of a. hyper-parameters which are shared by all of the components in the mixture. The 
probabilistic structure of the resulting model is displayed diagrammatically in Figure|3 
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Fig. 2. Representation of a Bayesian PCA mixture as a probabilistic graphical model (directed 
acyclic graph) showing the hierarchical prior over W governed by the vector of shared hyper- 
parameters a. The boxes denote ‘plates’ comprising N independent observations of the data vector 
t„ (shown shaded) together with the corresponding hidden variables x„ and s„, with a similar 
plate denoting the M copies of the parameters associated with each component in the mixture. 



3 Variational Inference 



In common with many complex probabilistic models, exact computation cannot be per- 
formed analytically. We avoid the computational complexity, and difficulty of conver- 
gence assessment, associated with Markov chain Monte Carlo methods by using va- 
riational inference 1 10]. For completeness we first give a brief overview of variational 
methods and then describe the variational solution for the Bayesian Mixture PCA model. 

In order to motivate the variational approach, consider a general probabilistic model 
with parameters 9 = {6i} and observed data D, for which the marginal probability of 
the data is given by 



P{D) = J P{D,9)de. (12) 

We have already noted that integration with respect to the parameters is analytically 
intractable. Variational methods involve the introduction of a distribution Q{9) which, 
as we shall see shortly, provides an approximation to the true posterior distribution. 
Consider the following transformation applied to the log marginal likelihood 



In P{D) = Iny P{D,9) 



de 



= In 



P{D,9) 






(13) 

(14) 

(15) 



where we have applied Jensen’s inequality. We see that the function C{Q) forms a 
rigorous lower bound on the true log marginal likelihood. The significance of this trans- 
formation is that, through a suitable choice for the Q distribution, the quantity £{Q) 
may be tractable to compute, even though the original log likelihood function is not. 
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From El) it is easy to see that the difference between the true log marginal likelihood 
In P{D) and the bound C{Q) is given by 



KL(Q||P) = -J Q{6) In dO (16) 

which is the Kullback-Leibler (KL) divergence between the approximating distribution 
Q{9) and the true posterior P{9\D). The relationship between the various quantities is 
shown in Figure 0 




Fig. 3. The quantity C{Q) provides a rigorous lower bound on the true log marginal likelihood 
InP(D), with the difference being given by the Kullback-Leibler divergence KL(Q||P) between 
the approximating distribution Q{6) and the true posterior P{0\D). 



Suppose we consider a completely free-form optimization over Q, allowing for all 
possible Q distributions. Using the well-known result that the KL divergence between 
two distributions Q{9) and P{9) is minimized by Q{9) = P{9) we see that the optimal 
Q distribution is given by the true posterior, in which case the KL divergence is zero, the 
bound becomes exact and C{Q) = In P{D). However, this will not lead to any simplifi- 
cation of the problem since, by assumption, direct evaluation of In P{D) is intractable. 
In order to make progress it is necessary to restrict the range of Q distributions. 

The goal in a variational approach is to choose a suitable form for Q{9) which is 
sufficiently simple that the lower bound C{Q) can readily be evaluated and yet which 
is sufficiently flexible that the bound is reasonably tight. We generally choose some 
family of Q distributions and then seek the best approximation within this family by 
maximizing the lower bound C{Q). Since the true log likelihood is independent of Q 
we see that this is equivalent to minimizing the Kullback-Leibler divergence. 

One approach is to consider a parametric family of Q distributions of the form 
Q{9; Ip) governed by a set of parameters ip. We can then adapt ip by minimizing the 
KL divergence to find the best approximation within this family. Here we consider an 
alternative approach which is to restrict the functional form of Q{9) by assuming that it 
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factorizes over the component variables {9i} in 6, so that 

Q{e) = l[Q,{0,). (17) 

i 

The KL divergence can then be minimized over all possible factorial distributions by 
performing a free-form minimization over each of the Qi, leading to the following result 

exp ( lnP(Z7, 0)) 

= 7 /I P(nn\\ M 

J exp(lnP(i7,0))^^.d6»i 

where ( • )k^i denotes an expectation with respect to the distributions Qk{6k) for all 
k i. For models having suitable conjugate choices for prior distributions, the right hand 
side of (trsi can be expressed as a closed-form analytic distribution. Note, however, that 
it still represents a set of coupled implicit solutions for the factors Qk{9k)- In practice, 
therefore, these factors are suitably initialized and are then cyclically updated using (tra . 

It is worth emphasizing that, for models such as the one discussed in this paper 
for which this framework is tractable, it is also possible to calculate the lower bound 
C{Q) itself in closed form. Numerical evaluation of this bound during the optimization 
process allows convergence to be monitored, and can also be used for Bayesian model 
comparison since it approximates the log model probability In P{V). It also provides a 
check on the accuracy of the mathematical solution and its numerical implementation, 
since the bound can never decrease as the result of updating one of the Qi . 

3.1 Variational Solution for Bayesian PCA Mixtures 

In order to apply this framework to Bayesian PCA we assume a Q distribution of the 
form 



Q{S, X, 7T, W, a, /.t, r) = Q{S)Q{X\S)Q{7v)Q{W)Q{a)Q{ti)Q{T) (19) 
where X — {x„}. The joint distribution of data and parameters is given by 

N 



P[P(t„|x„,W,/r,T,.S) 



n—1 



P{X)P{S\^v)P{^T)P(W\a)P{a)P{^l)P{T). (20) 



Using (IIS)j and (l?ni) in fTO . and substituting for the various P(-) distributions, we 
obtain the following results for the component distributions of Q(-) 



N 



Q(A|S') = P[ (5(x„|s„) 

n—1 

Q(x„|s„ = <5™) = 

M 

Q(m) = n 



( 21 ) 

(22) 



(23) 
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M d 

Q(W) = n n Ar(wfem|rnt"’"), i:«) (24) 

m —1 k —1 
M q 

Q(a) = n (25) 

m —1 1=1 

Q(t) = r{r\ar,hr) (26) 

M 

Q(7T)= n (27) 

m=l 

N 

Q(S) = J] Q(s„) (28) 

n=l 



where wj, denotes a column vector corresponding to the kth row of W. Here we have 
defined 



i:M = (l, + (r)(WTw^))-' 

N 

'^{Snm) (t„ - (Wrn)(x„|m)) 






n=l 

N 



sp = \^f3+{T)Y^{Sr.^)j I, 

N 

^(km) ^ ^(s„m)(x„|TO)(f„fe - (Aifc)) 



n=l 



AT 



= |^diag(o;,n) + (r) y^(gnm)(xnX^|m) 
d 



n=l 



a. = ao + - = 6c + 

1 



•mj I 



Oi- = Ot + 



iVd 



2 “ 2 2 
Af M 



(29) 

(30) 

(31) 

(32) 

(33) 

(34) 

(35) 



(>- = 2 E E ) + Tr((WT W^) 

n=l m=l 

(x„x^|m)). +2(/x;^)(W™)(x„|m) - 2t;^(W^)(x„|m) - 2tl{fiJ} (36) 

N 

=U^ + J2iSnm) (37) 



lnQ(s„ = 5 ^) = (ln 7 T„) - i(x;^x„|m) - ^(r) {||t„f + (H/x^f ) 

+Tr((WT W^)(x„xT|m)) + 2 (/xT )(W„)(x„|m) (38) 

-2tl(W^){yi„\m) - ^ In | + const. (39) 
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where diag(a) denotes a diagonal matrix whose diagonal elements are given by {ui). 
The constant in In Q{sn = <5„) is found simply by summing and normalizing. Note also 
that (x„|m) denotes an average with respect to Q(x„|s„ = Sm)- 

The solution for the optimal factors in the Q(0) distribution is, of course, an implicit 
one since each distribution depends on moments of the other distributions. We can find 
a solution numerically by starting with a suitable initial guess for the distributions and 
then cycling through the groups of variables in turn, re-estimating each distribution using 
the above results. The required moments are easily evaluated using the standard results 
for normal and Gamma distributions. 

Our framework also permits a direct evaluation of the posterior distribution over 
the number M of components in the mixture (assuming a suitable prior distribution, 
for example a uniform distribution up to some maximum value). However, in order to 
reduce the computational complexity of the inference problem we adopt an alternative 
approach based on model comparison using the numerically evaluated lower bound £(Q) 
which approximates the log model probability InP(H). Our optimization mechanism 
dynamically adapts the value of M through a scheme involving the addition and deletion 
of components nas). 

One of the limitations of fitting conventional Gaussian mixture models by maximum 
likelihood is that there are singularities in the likelihood function in which a component’s 
mean coincides with one of the data points while its covariance shrinks to zero. Such 
singularities do not arise in the Bayesian framework due to the implicit integration over 
model parameters. 

4 Results 

In order to demonstrate the operation of the algorithm, we first explore its behaviour 
using synthetic data. The example on the left of Figure E] shows synthetic data in two 
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Fig. 4. Examples of Bayesian PCA mixture models fitted to synthetic data. 
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dimensions, together with the result of fitting a Bayesian PCA mixture model. The 
lines represent the non-zero principal directions of each component in the mixture. At 
convergence the model had 8 components, having a common effective dimensionality 
of 1 . The right hand plot in Figure |5| shows synthetic data from a noisy 2-dimensional 
sphere in 3 dimensions together with the converged model, which has 12 components 
having effective dimensionality of 2. Similar results with synthetic data are robustly 
obtained when embedding low-dimensional non-linear manifolds in spaces of higher 
dimensionality. 

We now apply our framework to the problem of modelling the manifold of a data set 
of face images. The data used is a combination of images from the Yale face database 
and the University of Stirling database. The training set comprises 276 training images, 
which have been cropped, subsampled to 26 x 15, and normalized pixelwise to zero 
mean and unit variance. A test set consisting of a further 100 face images, together with 
200 non-face images, taken from the Corel database, all of which were pre-processed in 
the same way as the training data. 

The converged Bayesian PCA mixture model has 4 components, having a com- 
mon dimensionality of 5, as emphasized by the Hinton diagram of the shared a hyper- 
parameters shown in Figure 




Fig. 5. Hinton diagram showing the inverses of the a hyper-parameters (corresponding to the 
variances of the principal components) indicating a manifold of intrinsic dimensionality 5. 



In order to see how well the model has captured the manifold we first run the model 
generatively to give some sample synthetic images, as shown in Figured! Synthetic faces 




SB 
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Fig. 6. Synthetic faces obtained by running the learned mixture distribution generatively. 
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generated from a single probabilistic PCA model are less noticeably distinct. 

We can quantify the extent to which we have succeeded in modelling the manifold 
of faces by using the density model to classify the images in the test set as faces versus 
non-faces. To do this we evaluate the density under the model for each test image and 
if this density exceeds some threshold the image is classified as a face. The threshold 
value determines the trade-off between false negatives and false positives, leading to an 
ROC curve, as shown in FigureQ For comparison we also show the corresponding ROC 
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Fig. 7. ROC curves for classifying images as faces versus non-faces for the Bayesian PCA mixture 
model, together with the corresponding results for a maximum likelihood PCA model with various 
values for q the number of retained principal components. This highlights the significant impro- 
vement in classification performance in going from a linear PCA model to a non-linear mixture 
model. 



curves for a single maximum likelihood PCA model for a range of different q values. 
We see that moving from a linear model (PCA) to a non-linear model (a Bayesian PCA 
mixture) gives a significant improvement in classification performance. This result also 
highlights the fact that the Bayesian approach avoids the need to set parameter values 
such as q by exhaustive exploration. 

As a second application of our framework we model the manifolds of images of hand- 
written digits. We use a data set taken from the CEDAR U.S. Postal Service database, and 
comprising 1 1 ,000 images (equally distributed over the ten digits) each of which is 8 x 8 
grayscale, together with a similar independent test set of 271 1 images. Synthetic images 
generated from a Bayesian PCA mixture model fitted to the training set are shown in 
FigurelHl 

The learned model achieved 4.83% error rate on the test set. For comparison we 
note that Tipping and Bishop O used the same training and test sets with a maximum 
likelihood mixture of probabilistic principal component analysers. The training set in 



16 C.M. Bishop and J.M. Winn 



% K R 
LI a Q 

b M 

Q ES ^ 
M ^ H 
Pi 7^ ^ 
^ n n 

a n 

1^ £i !£l 



a % » 

1131 Q IQ 

4 ifi 

K K PJ 

Q a n 

Cl n Fj 

K M rfl 

IN lb IH 

^ n 

ib 1^ EJ 



>z; % r4i 

Li n □ 

s n 

K Pi ftj 

O ff S] 

i£! 

la f i 

^ 9 a ^ 




Fig. 8. Digits synthesized from each of the ten trained Bayesian PCA mixture model by running 
the models generatively. 



this case was itself subdivided into training plus validation sets. For each of the ten 
digit models considerable computational effort was expended in finding the optimum 
values of M (the number of components in the mixture) and q (the dimensionality of 
the latent spaces) by evaluation of performance on the validation set. This approach 
achieved 4.61% error rate on the test set, which is comparable with the result obtained 
from the single run of the Bayesian PCA mixture model. 



5 Discussion 

In this paper we have introduced a fully probabilistic approach to modelling the manifolds 
of images in which an appropriate model complexity, as well as the manifold intrinsic 
dimensionality, can be inferred automatically from the data. Preliminary results on data 
sets of face images and hand- written digits demonstrate both the practical feasiblity of 
the framework as well as improved performance compared to previous approaches. 

An important advantage of our framework is that there are no significant adjustable 
parameters in the model to be set by the user. The model complexity is inferred from 
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the data, and since no model optimization is required the model can he run once on the 
training data, without the need for computationally intensive cross-validation. 
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Abstract. We present a method to learn object class models from unlabeled and 
unsegmented cluttered scenes for the purpose of visual object recognition. We 
focus on a particular type of model where objects are represented as flexible 
constellations of rigid parts (features). The variability within a class is represented 
by a joint probability density function (pdf) on the shape of the constellation and 
the output of part detectors. In a first stage, the method automatically identifies 
distinctive parts in the training set by applying a clustering algorithm to patterns 
selected by an interest operator. It then learns the statistical shape model using 
expectation maximization. The method achieves very good classification results 
on human faces and rear views of cars. 



1 Introduction and Related Work 

We are interested in the problem of recognizing members of object classes, where we 
define an object class as a collection of objects which share characteristic features or 
parts that are visually similar and occur in similar spatial configurations. When building 
models for object classes of this type, one is faced with three problems (see Fig. [TJ. 
Segmentation or registration of training images: Which objects are to he recognized 
and where do they appear in the training images? Part selection: Which object parts are 
distinctive and stable? Estimation of model parameters: What are the parameters of the 
global geometry or shape and of the appearance of the individual parts that best describe 
the training data? 

Although solutions to the model learning problem have been proposed, they typically 
require that one of the first two questions, if not both, be answered by a human supervisor. 
For example, features in training images might need to be hand-labeled. Oftentimes 
training images showing objects in front of a uniform background are required. Objects 
might need to be positioned in the same way throughout the training images so that a 
common reference frame can be established. 

Amit and Geman have developed a method for visual selection which learns a hier- 
archical model with a simple type of feature detector (edge elements) as its front end HI . 
The method assumes that training images are registered with respect to a reference grid. 
After an exhaustive search through all possbile local feature detectors, a global model 
is built, under which shape variability is encoded in the form of small regions in which 
local features can move freely. 

D. Vemon (Ed.): ECCV 2000, LNCS 1842, pp. 18-^ 2000. 

© Springer- Verlag Berlin Heidelberg 2000 
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Fig. 1. Which objects appear consistently in the left images, but not on the right side? Can a 
machine learn to recognize instances of the two object classes (faces and cars) without any further 
information provided? 



Burl et al. have proposed a statistical model in which shape variability is modeled 
in a probabilistic setting using Dryden-Mardia shape space densities 11211 1111411 . Their 
method requires labeled part positions in the training images. 

Similar approaches to object recognition include the active appearance models of 
Taylor et al. who model global deformations using Eigenspace methods as well as 
the Dynamic Link Architecture of v. der Malsburg and colleagues, who consider defor- 
mation energy of a grid that links landmark points on the surface of objects ca. Also 
Yuille has proposed a recognition method based on gradient descent on a deformation 
energy function in lO- It is not obvious how these methods could be trained without 
supervision. 

The problem of automatic part selection is important, since it is generally not esta- 
blished that parts that appear distinctive to the human observer will also lend themselves 
to successful detection by a machine. Walker et al. address this problem in ||14|. albeit 
outside the realm of statistical shape models. They emphasize “distinctiveness” of a part 
as criteion of selection. As we will argue below, we believe that part selection has to be 
done in the context of model formation. 

A completely unsupervised solution of the three problems introduced at the begin- 
ning, in particular the first one, may seem out of reach. Intuition suggests that a good deal 
of knowledge about the objects in question is required in order to know where and what 
to look for in the cluttered training images. However, a solution is provided by the ex- 
pectation maximization framework which allows simultaneous estimation of unknown 
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data and probability densities over the same unknown data. Under this framework, all 
three problems are solved simultaneously. 

Another compelling reason to treat these problems jointly, is the existing trade-off 
between localizability and distinctiveness of parts. A very distinctive part can be a strong 
cue, even if it appears in an arbitrary location on the surface of an object — think e.g. 
of a manufacturer’s logo on a car. On the other hand, a less distinctive part can only 
contribute information if it occurs in a stable spatial relationship relative to other parts. 



2 Approach 

We model object classes following the work of Burl et al. lil. An object is compo- 
sed of parts and shape, where ‘parts’ are image patches which may be detected and 
characterized by appropriate detectors, and ‘shape’ describes the geometry of the mu- 
tual position of the parts in a way that is invariant with respect to rigid and, possibly, 
affine transformations [O- A joint probability density on part appearance and shape 
models the object class. Object detection is performed by first running part detectors on 
the image, thus obtaining a set of candidate part locations. The second stage consists 
of forming likely object hypotheses, i.e. constellations of appropriate parts (e.g. eyes, 
nose, mouth, ears); both complete and partial constellations are considered, in order to 
allow for partial occlusion. The third stage consists of using the object’s joint probability 
density for either calculating the likelihood that any hypothesis arises from an object 
(object detection), or the likelihood that one specific hypothesis arises from an object 
(object localization). In order to train a model we need to decide on the key parts of the 
object, select corresponding parts (e.g. eyes, nose etc) on a number of training images, 
and lastly we need to estimate the joint probability density function on part appearance 
and shape. Burl et al. |0j perform the first and second act by hand, only estimating the 
joint probability density function automatically. In the following, we propose methods 
for automating the first and second steps as well. 

Our technique for selecting potentially informative parts is composed of two steps 
(see Fig. □). First, small highly textured regions are detected in the training images by 
means of a standard ‘interest operator’ or keypoint detector. Since our training images 
are not segmented, this step will select regions of interest both in the image areas cor- 
responding to the training objects and on the clutter of the background. If the objects 
in the training set have similar appearance then the textured regions corresponding to 
the objects will frequently be similar to each other as opposed to the textured regions 
corresponding to the background which will be mostly uncorrelated. An unsupervised 
clustering step favoring large clusters will therefore tend to select parts that correspond 
to the objects of interest rather than the background. Appropriate part detectors may be 
trained using these clusters. 

The second step of our proposed model learning algorithm chooses, out of these most 
promising parts, the most informative ones and simultaneously estimates the remaining 
model parameters. This is done by iteratively trying different combinations of a small 
number of parts. At each iteration, the parameters of the underlying probabilistic model 
are estimated. Depending on the performance of the model on a validation data set, the 
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choice of parts is modified. This process is iterated until the final model is obtained when 
no further improvements are possible. 




Fig. 2. Block diagram of our method. “Foreground images” are images containing the target objects 
in cluttered background. “Background images” contain background only. 



Outline of the Paper In Sectional we present our statistical object model. Section 0] 
discusses automatic part selection. Section 0is dedicated to the second step of model 
formation and training. Section^demonstrates the method through experiments on two 
datasets: cars and faces. 



3 Modeling Objects in Images 

Our model is based on the work by Burl et al. lH . Important differences are that we model 
the positions of the background parts through a uniform density, while they used a Gaus- 
sian with large covariance. The probability distribution of the number of background 
parts, which Burl et al. ignored, is modeled in our case as a Poisson distribution. 



3.1 Generative Object Model 



We model objects as collections of rigid parts, each of which is detected by a correspon- 
ding detector during recognition. The part detection stage therefore transforms an entire 
image into a collection of parts. Some of those parts might correspond to an instance of 
the target object class {the foreground), while others stem from background clutter or are 
simply false detections (the background). Throughout this paper, the only information 
associated with an object part is its position in the image and its identity or part type. 
We assume that there are T different types of parts. The positions of all parts extracted 
from one image can be summarized in a matrix-like form. 



= 



\xtiXT2, ■ • ■ , XtNt 



where the superscript ‘o’ indicates that these positions are observable in an image, as 
opposed to being unobservable or missing, which will be denoted by ‘m.’ Thus, the 
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contains the locations of detections of part type t, where every entry, Xij, is a 
two-dimensional vector. If we now assume that an object is composed of F different 
parts £] we need to be able to indicate which parts in X° correspond to the foreground 
(the object of interest). For this we use the vector h, a set of indices, with hi = j > 0, 
indicating that point Xy is a foreground point. If an object part is not contained in 
X°, because it is occluded or otherwise undetected, the corresponding entry in h will be 
zero. When presented with an unsegmented and unlabeled image, we do not know which 
parts correspond to the foregound. Therefore, h is not observable and we will treat it as 
hidden or missing data. We call h a hypothesis, since we will use it to hypothesize that 
certain parts of X° belong to the foreground object. It is also convenient to represent 
the positions of any unobserved object parts in a separate vector x™ which is, of course, 
hidden as well. The dimension of x"* will vary, depending on the number of missed 
parts. 

We can now define a generative probabilistic model through the joint probability 
density 

p(X°,x™,h). (1) 

Note that not only the entries of X° and x™ are random variables, but also their dimen- 
sions. 

3.2 Model Details 

In order to provide a detailed parametrization of 0 , we introduce two auxiliary varia- 
bles, b and n. The binary vector b encodes information about which parts have been 
detected and which have been missed or occluded. Hence, 6/ = 1 if hf>Q and bf — 0 
otherwise. The variable n is also a vector, where rit shall denote the number of back- 
ground candidates included in the row of X°. Since both variables are completely 
determined by h and the size of X°, wehavep(2f°, x™, h) = p(2f°, x™, h, n, b). Since 
we assume independence between foreground and background, and, thus, between p(n) 
and p(b), we decompose in the following way 

p(X°,x™,h,n,b) =p(X°,x™|h,n)p(h|n,b)p(n)p(b). (2) 

The probability density over the number of background detections can be modeled 
by a Poisson distribution. 



^ 1 

Li 

where Mt is the average number of background detections of type t per image. This 
conveys the assumption of independence between part types in the background and 
the idea that background detections can arise at any location in the image with equal 
probability, independently of other locations. For a discrete grid of pixels, p(n) should 
be modeled as a binomial distribution. However, since we will model the foreground 

* To simplify notation, we only consider the case where F = T. The extension to the general 
case {F > T) is straightforward. 
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detections over a continuous range of positions, we chose the Poisson distribution, which 
is the continuous limit of the binomial distribution. Admitting a different Mf for every 
part type allows us to model the different detector statistics. 

Depending on the number of parts, F, we can model the probability p(b) either as 
an explicit table (of length 2^) of joint probabilities, or, if F is large, as F independent 
probabilities, governing the presence or absence of an individual model part. The joint 
treatment could lead to a more powerful model, e.g., if certain parts are often occluded 
simultaneously. 

The density p(h|n, b) is modeled by, 

j^hcH(b,„) 

0 other h 

where TL{h,n) denotes the set of all hypotheses consistent with b and n, and Nf 
denotes the total number of detections of the type of part /. This expresses the fact that 
all consistent hypotheses, the number of which is 0^=1 equally likely in the 

absence of information on the part locations. 

Finally, we use 

p(A°,x™|h,n) =pfg(z)pbg(xf,g), 

where we defined = (x°^x’”^) as the coordinates of all foreground detections 
(observed and missing) and Xf,g as the coordinates of all background detections. Here 
we have made the important assumption that the foreground detections are independent 
of the background. In our experiments, Pfg(z) is modeled as a joint Gaussian with mean 
^ and covariance E. 

Note that, so far, we have modeled only absolute part positions in the image. This 
is of little use, unless the foreground object is in the same position in every image. We 
can, however, obtain a translation invariant formulation of our algorithm (as used in the 
experiments in this paper) by describing all part positions relative to the position of one 
reference part. Under this modification, pfg will remain a Gaussian density, and there- 
fore not introduce any fundamental difficulties. However, the formulation is somewhat 
intricate, especially when considering missing parts. Hence, for further discussion of 
invariance the reader is referred to OH. 

The positions of the background detections are modeled by a uniform density, 

Fbg(Xbg) = 

t = l 



p(h|n,b) = 



where A is the total image area. 

3.3 Classification 

Throughout the experiments presented in this paper, our objective is to classify images 
into the classes “object present” (class C\) and “object absent” (class Co). Given the 
observed data, X°, the optimal decision — minimizing the expected total classification 
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error — is made by choosing the class with maximum a posteriori probability (MAP 
approach, see e.g. Q). It is therefore convenient to consider the following ratio, 

p{CAX°) EhP(^°.h|Ci) 

p(Co|X°) p{Xo,ho\Co) ’ ^ ^ 

where ho denotes the null hypothesis which explains all parts as background noise. 
Notice that the ratio is omitted, since it can be absorbed into a decision threshold. 
The sum in the numerator includes all hypotheses, also the null hypothesis, since the 
object could be present but remain undetected by any part detector. In the denominator, 
the only consistent hypothesis to explain “object absent” is the null hypothesis. 

Although we are here concerned with classihcation only, our framework is by no me- 
ans restricted to this problem. For instance, object localization is possible by identifying 
those foreground parts in an image, which have the highest prohahility of corresponding 
to an occurrence of the target object. 

4 Automatic Part Selection 

The problem of selecting distinctive and well localizeable object parts is intimately 
related to the method used to detect these parts when the recognition system is finally put 
to work. Since we need to evaluate a large number of potential parts and thus, detectors, 
we settled on normalized correlation as an efficient part detection method. Furthermore, 
extensive experiments lead us to believe that this method offers comparable performance 
over many more elaborate detection methods. 

With correlation based detection, every pattern in a small neighborhood in the training 
images could be used as a template for a prospective part detector. The purpose of the 
procedure described here is to reduce this potentially huge set of parts to a reasonable 
number, such that the model learning algorithm described in the next section can then 
select a few most useful parts. We use a two-step procedure to accomplish this. 

In the first step, we identify points of interest in the training images (see Fig. |3l. 
This is done using the interest operator proposed by Forstner |0|, which is capable of 
detecting corner points, intersections of two or more lines, as well as center points of 
circular patterns. This step produces about 150 part candidates per training image. 

A significant reduction of the number of parts can be achieved by the second step of 
the selection process, which consists in performing vector quantization on the patterns (a 
similar procedure was used by Leung and Malik in |1 4||L To this end, we use a standard 
/c-means clustering algorithm IIS'!, which we tuned to produce a set of about 100 patterns. 
Each of these patterns represents the center of a cluster and is obtained as the average of 
all patterns in the cluster. We only retain clusters with at least 10 patterns. We impose this 
limit, since clusters composed of very few examples tend to represent patterns which do 
not appear in a significant number of training images. Thus, we obtain parts which are 
averaged across the entire set of training images. 

In order to further eliminate redundancies, we remove patterns which are simlilar to 
others after a small shift (1-2 pixels) in any an arbitrary direction. 

Due to the restriction to points of interest the set of remaining patterns exhibits 
interesting structure, as can be seen in Figure 0 Some parts, such as human eyes, can 
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Fig. 3. Points of interest (left) identified on a training image of a human face in cluttered back- 
ground using Fdrstner’s method. Crosses denote comer- type patterns while circles mark circle-type 
patterns. A sample of the patterns obtained using k-means clustering of small image patches is 
shown for faces (center) and cars (right). The car images were high-pass filtered before the part 
selection process. The total number of patterns selected were 81 for faces and 80 for cars. 



be readily identified. Other parts, such as simple corners, result as averages of larger 
clusters, often containing thousands of patterns. 

This procedure dramatically reduces the number of candidate parts. However, at this 
point, parts corresponding to the background portions of the training images are still 
present. 

5 Model Learning 



In order to train an object model on a set of images, we need to solve two problems. Firstly, 
we need to decide on a small subset of the selected part candidates to be used in the model, 
i.e. define the model configuration. Secondly, we need to learn the parameters underlying 
the probability densities. We solve the first problem using an iterative, “greedy” strategy, 
under which we try different configurations. At each iteration, the pdfs are estimated 
using expectation maximization (EM). 

5.1 Greedy Model Configuration Search 

An important question to answer is with how many parts to endow our model. As the 
number of parts increases, models gain complexity and discriminatory power. It is the- 
refore a good strategy to start with models comprised of few parts and add parts while 
monitoring the generalization error and, possibly, a criterion penalizing complexity. 

If we start the learning process with few parts, say F = 3, we are still facing the 
problem of selecting the best out of possible sets of parts, where N is the number 
of part candidates produced as described in Sec.0 We do this iteratively, starting with 
a random selection. At every iteration, we test whether replacing one model part with 
a randomly selected one, improves the model. We therefore first estimate all remaining 
model parameters from the training images, as explained in the next section, and then 
assess the classification performance on a validation set of positive and negatives exam- 
ples. If the performance improves, the replacement part is kept. This process is stopped 
when no more improvements are possible. We might then start over after increasing the 
total number of parts in the model. 
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It is possible to render this process more efficient, in particular for models with many 
parts, by prioritizing parts which have previously shown a good validation performance 
when used in smaller models. 



5.2 Estimating Model Parameters through Expectation Maximization 

We now address the problem of estimating the model pdfs with a given set of model 
parts, from a set of I training images. 

Since our detection method relies on the maximum a posteriori probability (MAP) 
principle, it is our goal to model the class conditional densities as accurately as possible. 
We therefore employ the expectation maximization (EM) algorithm to produce maximum 
likelihood estimates of the model parameters, 9 = {/i, E, p(b), M}. EM is well suited 
for our problem, since the variables h and x™ are missing and must be inferred from the 
observed data, {X°}. In standard EM fashion, we proceed by maximizing the likelihood 
of the observed data, 

L{{xn\e) = ElogE / p(^°,xr,h,i0)dxr, 

hi 

with respect to the model parameters. Since this is difficult to achieve in practice, EM 
iteratively maximizes a sequence of functions, 

/ 

Q{m = Y,E[\ogp{x°,^TMe)l 

i=l 

where E[.] refers to the expectation with respect to the posterior p(hi, x™|A°, 0). 
Throughout this section, a tilde denotes parameters we are optimizing for, while no 
tilde implies that the values from the previous iteration are substituted. EM theory |t§| 
guarantees that subsequent maximization of the Q{9\6) converges to a local maximum 
ofL. 

We now derive update rules that will be used in the M-step of the EM algorithm. The 
parameters we need to consider are those of the Gaussian governing the distribution of 
the foreground parts, i.e. p, and E, the table representing p(b) and the parameter, M, 
governing the background densities. It will be helpful to decompose Q into four parts, 
following the factorization in Equation ( 01 . 

Q{9\0) = Qi{0\9) + Q2{0\0) + Qsim + Q4 

III 

= E^[logp(ni|6»)] +E^[logP(b*|6»)] +E^[logP(V°>x™|h,,n,,6»)] 

i—1 i—1 i—1 

I 

Only the first three terms depend on parameters that will be updated during EM. 
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Update rule for /jl Since only Qs depends on p,, taking the derivative of the expected 
likelihood yields 

d ^ 

^ i=l 

where = (x°^ according to our definition above. Setting the derivative to zero 
yields the following update rule 

P = 

i=l 

Update rule for S Similarly, we obtain for the derivative with respect to the inverse 
covariance matrix 

^Q3m = J2E\\s-\{z.-P){z.-pf . 

i—1 ^ 

Equating with zero leads to 

1 ^ 1 ^ 

E= jYl = j XI E[z^zJ'] - pp^. 



Update rule for p(b) To hnd the update rule for the 2^ probability masses of p(b), 
we need to consider Q2{0\0), the only term depending on these parameters. Taking the 
derivative with respect to p(b), the probability of observing one specific vector, b, we 
obtain 



d 

9p(b) 



Q^m = Y. 
2=1 



p{h) 



where 6 shall denote the Kronecker delta. Imposing the constraint X)bP(^) = 
instance by adding a Lagrange multiplier term, we find the following update rule for 
P(b), 



p(b) = i^S[<5b,6]. 



Update rule for M Finally, we notice that is the only term containing infor- 

mation about the mean number of background points per part type Mf. Differentiating 
Qi{O\0) with respect to M we find. 



d 

Wl 



Qiim = E 

2 = 1 



E[nj\ 

M 



-I. 



Equating with zero leads to the intuitively appealing result 



M^jYE[np. 

2 = 1 
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Computing the Sufficient Statistics All update rules derived above are expressed in 
terms of sufficient statistics, E\x], E\5y^ {,] and i?[n] which are calculated in 

the E-step of the EM algorithm. We therefore consider the posterior density, 



p(h„xriA°,0) = 



p(h„xr,x°|0) 






The denominator in the expression above, which is equal to p{X°), is calculated hy 
explicitly generating and summing over all hypotheses, while integrating ou^the missing 
data of each hypothesis. The expectations are calculated in a similar fashion: E\5y^ j,] is 
calculated by summing only over those hypotheses consistent with b and dividing hy 
p{X°). Similarly, E[\\i] is calculated hy averaging n(h) over all hypotheses. The case of 
£^[z] is slightly more complicated. Eor every hypothesis we regroup z^ = (x°^ x"*^) 
and note that E[x°] — x°. For E[x^] one needs to calculate, 

J x’" G{z\p, E) dx™ = p"* -p 



where we defined. 



p = 



p" 



E = 



jjoo jjom 
^mo jjmm 



Summing over all hypothesis and dividing hy p{X°) establishes the result. Finally we 
need to calculate 

xOx°T x°E;[x’"]^ \ 

e;[x™]x°^ e;[x'"x™^] ) • 



£’[zz^] = 



Here, only the part i?[x'"x™^] has not yet been considered. Integrating out the missing 
dimensions, x™, now involves, 

J X™X™^ G{z\p, E) dx”" = r"*™ - Sraojjoo-ljjmoT E[x^]E[x^f . 

Looping through all possible hypotheses and dividing by p(X°) again provides the 
desired result. This concludes the E-step of the EM algorithm. 



6 Experiments 

In order to validate our method, we tested the performance, under the classification 
task described in Sect. lli.lil on two data sets: images of rear views of cars and images 
of human faces. As mentioned in Sec. 01 the experiments described below have been 
performed with a translation invariant extension of our learning method. All parameters 
of the learning algorithm were set to the same values in both experiments. 

^ Integrating out dimensions of a Gaussian is simply done by deleting the means and covariances 
of those dimensions and multiplying by the suitable normalization constant. 



Unsupervised Learning of Models for Recognition 



29 





Model Performance 



Fig. 4. Results of the learning experiments. On the left we show the best performing car model 
with four parts. The selected parts are shown on top. Below, ellipses indicating a one-std deviation 
distance from the mean part positions, according to the foreground pdf have been superimposed 
on a typical test image. They have been aligned by hand for illustrative purposes, since the models 
are translation invariant. In the center we show the best four-part face model. The plot on the right 
shows average training and testing errors measured as 1 — Aroc, where Aroc is the area under 
the corresponding ROC curve. For both models, one observes moderate overfitting. For faces, the 
smallest test error occurs at 4 parts. Hence, for the given amount of training data, this is the optimal 
number of parts. For cars, 5 or more parts should be used. 



Training and Test Images For each of the two object classes we took 200 images 
showing a target object at an arbitrary location in cluttered background (Fig.^ left). We 
also took 200 images of background scenes from the same environment, excluding the 
target object (Fig.[IJ right). No images were discarded by hand prior to the experiments. 
The face images were taken indoors as well as outdoors and contained 30 different people 
(male and female). The car images were taking on public streets and parking lots where 
we photographed vehicles of different sizes, colors and types, such as sedans, sport utility 
vehicles, and pick-up trucks. The car images were high-pass filtered in order to promote 
invariance with respect to the different car colors and lighting conditions. All images 
were taken with a digital camera; they were converted to a grayscale representation and 
downsampled to a resolution of 240 x 160 pixels. 

Each image set was randomly split into two disjoint sets of training and test images. 
In the face experiment, no single person was present in both sets. 




Fig. 5. Multiple use of parts: The three-part model on the left correctly classified the four images 
on the right. Part labels are: Q = ‘A’, □ = ‘B’, O = ‘C’. Note that the middle part (C) exhibits 
a high variance along the vertical direction. It matches several locations in the images, such as 
the bumper, license plate and roof. In our probabilistic framework, no decision is made as to the 
correct match. Rather, evidence is accumulated across all possible matches. 
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Automatically Selected Parts Parts were automatically selected according to the proce- 
dure described in Sec.|^ The Forstner interest operator was applied to the 100 unlabeled 
and unsegmented training images containing instances of the target object class. We 
performed vector quantization on grayscale patches of size 11x11 pixels, extracted 
around the points of interest. A different set of patterns was produced for each object 
class, as shown in Figure 0 

Model Learning We learned models with 2,3,4, and 5 parts for both data sets. Since the 
greedy conhguration search as well as the EM algorithm can potentially converge to local 
extrema, we learned each model up to 100 times, recording the average classification 
error. 

All models were learned from the entire set of selected parts. Flence, no knowledge 
from the training of small models about the usefulness of parts was applied during the 
training of the larger models. This was done in order to investigate to what extent the 
same parts were chosen across model sizes. 

We found the EM algorithm to converge in about 100 iterations, which corresponds 
to less than 10s for a model with two parts and about 2min for a five-part model. We 
used a Matlab implementation with subroutines written in ‘C’ and a PC with 450MHz 
Pentium II processor. The number of different part configurations evaluated varied from 
about 80-150 (2 parts) to 300-400 (5 parts). 

Results Instead of classifying every image by applying a fixed decision threshold accor- 
ding to m , we computed receiver operating characteristics (ROCs) based on the ratio of 
posterior probabilities. In order to reduce sensitivity to noise due to the limited number 
of training images and to average across all possible values for the decision threshold, 
we used the area under the ROC curve as a measure of the classification performance 
driving the optimization of the model configuration. In Eigure 0 we show two learned 
models as well as this error measure as a function of the number of parts. 

Examples of successfully and wrongly classified images from the test sets are shown 

inFig.l^ 

When inspecting the models produced, we were able to make several interesting 
observations. Eor example, in the case of faces, we found confirmation that eye corners 
are very good parts. But our intuition was not always correct. Features along the hairline 
turned out to be very stable, while parts containing noses were almost never used in the 
models. 

Before we introduced a high-pass filter as a preprocessing step, the car models 
concentrated on the dark shadow underneath the cars as most stable feature. Researchers 
familiar with the problem of tracking cars on freeways confirmed that the shadow is often 
the easiest way to detect a car. 

Oftentimes the learning algorithm took advantage of the fact that some part detectors 
respond well at multiple locations on the target objects (Fig. 0. This effect was most 
pronounced for models with few parts. It would be difficult to predict and exploit this 
behavior when building a model “by hand.” 

Since we ran the learning process many times, we were able to assess the likelihood 
of converging to local extrema. For each size, models with different part choices were 
produced. However, each choice was produced at least a few times. Regarding the EM 
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Fig. 6. Examples of correctly and incorrectly classified images from the test sets, based on the 
models in Fig.0 Part labels are: Q = ‘A’, O = ‘B’, O = ‘C’, V = ‘D’- 100 foreground and 100 
background images were classified in each case. The decision threshold was set to yield equal 
error rate on foreground and background images. In the case of faces, 93.5% of all images were 
classified correctly, compared to 86.5% in the more difficult car experiment. 

algorithm itself, we only observed one instance, where a given choice of parts resulted 
in several different classification performances. This leads us to conclude that the EM 
algorithm is extremely unlikely to get stuck in a local maximum. 

Upon inspection of the different part types selected across model sizes, we noticed 
that about half of all parts chosen at a particular model size were also present in smaller 
models. This suggests that initializing the choice of parts with parts found in well per- 
forming smaller models is a good strategy. However, one should still allow the algorithm 
to also choose from parts not used in smaller models. 



7 Discussion and Future Work 

We have presented ideas for learning object models in an unsupervised setting. A set 
of unsegmented and unlabeled images containing examples of objects amongst clutter 
is supplied; our algorithm automatically selects distinctive parts of the object class, 
and learns the joint probability density function encoding the object’s appearance. This 
allows the automatic construction of an efficient object detector which is robust to clutter 
and occlusion. 

We have demonstrated that our model learning algorithm works successfully on two 
different data sets: frontal views of faces and rear views of motor-cars. In the case of 
faces, discrimination of images containing the desired object vs. background images 
exceeds 90% correct with simple models composed of 4 parts. Performance on cars is 
87% correct. While training is computationally expensive, detection is efficient, requiring 
less than a second in our C-Matlab implementation. This suggests that training should 
be seen as an off-line process, while detection may be implemented in real-time. 

The main goal of this paper is to demonstrate that it is feasible to learn object models 
directly from unsegmented cluttered images, and to provide ideas on how one may do so. 
Many aspects of our implementation are suboptimal and susceptible of improvement. To 
list a few: we implemented the part detectors using normalized correlation. More sophi- 
sticated detection algorithms, involving multiscale image processing, multiorientation- 
multiresolution filters, neural networks etc. should be considered and tested. Moreover, 
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in our current implementation only part of the information supplied by the detectors, 
i.e. the candidate part’s location, is used; the scale and orientation of the image patch, 
parameters describing the appearance of the patch, as well as its likelihood, should be 
incorporated. Our interest operator as well as the unsupervised clustering of the parts 
have not been optimized in any respect; the choice of the algorithms deserves further 
scrutiny as well. An important aspect where our implementation falls short of generality 
is invariance: the models we learned and tested are translation invariant, but not rotation, 
scale or affine invariant. While there is no conceptual limit to this generalization, the 
straightforward implementation of the EM algorithm in the rotation and scale invariant 
case is slow, and therefore impractical for extensive experimentation. 

Acknowledgements This work was funded by the NSF Engineering Research Center for Neu- 
romorphic Systems Engineering (CNSE) at Caltech (NSE9402726), and an NSE National Young 
Investigator Award to P.P. (NSE9457618). M. Welling was supported by the Sloan Eoundation. 

We are also very grateful to Rob Eergus for helping with collecting the databases and to 
Thomas Leung, Mike Burl, Jitendra Malik and David Eorsyth for many helpful comments. 

References 

1. Y. Amit and D. Geman. A computational model for visual selection. Neural Computation, 
11(7):1691-1715, 1999. 

2. M.C. Burl, T.K. Leung, and P. Perona. “Pace Localization via Shape Statistics”. In Int 
Workshop on Automatic Face and Gesture Recognition, 1995. 

3. M.C. Burl, T.K. Leung, and P. Perona. “Recognition of Planar Object Classes”. In Proc. 
IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., 1996. 

4. M.C. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using 
local photometry and global geometry. In proc. ECCV’98, pages 628-641, 1998. 

5. T.P. Cootes and C.J. Taylor. “Locating Objects of Varying Shape Using Statistical Eeature 
Detectors”. In European Conf. on Computer Vision, pages 465-474, 1996. 

6. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em 
algorithm. Journal of the royal statistical society B, 39:1-38, 1976. 

7. R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 
Inc., 1973. 

8. G.J. Edwards, T.P.Cootes, and C.J.Taylor. Pace recognition using active appearance models. 

In Proc. Europ. Conf. Comput. Vision, H. Burkhardt and B. Neumann (Eds. ), LNCS-Series 

Vol. 1406-1407, Springer-Verlag, pages 581-595, 1998. 

9. R.M. Haralick and L.G. Shapiro. Computer and Robot Vision II. Addison- Wesley, 1993. 

10. M. Lades, J.C.Vorbmggen,J. Buhmann, J. Lange, C. v.d. Malsburg, R.P. Wurtz, andW. Konen. 
“Distortion Invariant Object Recognition in the Dynamic Link Architecture”. IEEE Trans. 
Comput., 42(3):300-311, Mar 1993. 

11. T.K. Leung, M.C. Burl, and P. Perona. “Pinding Paces in Cluttered Scenes using Random 
Labeled Graph Matching”. Proc. 5th Int. Conf. Computer Vision, pages 637-644, June 1995. 

12. T.K. Leung, M.C. Burl, and P. Perona. Probabilistic affine invariants for recognition. In Proc. 
IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., pages 678-684, 1998. 

13. T.K. Leung and J. Malik. Reconizing surfaces using three-dimensional textons. In Proc. 7th 
Int. Conf. Computer Vision, pages 1010-1017, 1999. 

14. K. N. Walker, T. P. Cootes, and C. J. Taylor. Locating salient facial features. In Int. Conf. on 
Automatic Face and Gesture Recognition, Nara, Japan, 1998. 

15. A.L.Yuille. “Deformable Templates for Pace Recognition”. J. of Cognitive Neurosci.,3(l):59- 
70, 1991. 




Learning over Multiple Temporal Scales in 
Image Databases 



Nuno Vasconcelos and Andrew Lippman 



MIT Media Laboratory, 

20 Ames St. E15-354, Cambridge, USA 
{nuno , lip}@media.mit . edu 
http: / /www. media. mit . edu/~nuno 



Abstract. The ability to learn from user interaction is an important as- 
set for content-based image retrieval (CBIR) systems. Over short times 
scales, it enables the integration of information from successive queries 
assuring faster convergence to the desired target images. Over long time 
scales (retrieval sessions) it allows the retrieval system to tailor itself 
to the preferences of particular users. We address the issue of learning 
by formulating retrieval as a problem of Bayesian inference. The new 
formulation is shown to have various advantages over previous approa- 
ches: it leads to the minimization of the probability of retrieval error, 
enables region-based queries without prior image segmentation, and sug- 
gests elegant procedures for combining multiple user specifications. As 
a consequence of all this, it enables the design of short and long-term 
learning mechanisms that are simple, intuitive, and extremely efficient 
in terms of computational and storage requirements. We introduce two 
such algorithms and present experimental evidence illustrating the clear 
advantages of learning for CBIR. 



1 Introduction 

Due to the large amounts of imagery that can now be accessed and managed via 
computers, the problem of CBIR has recently attracted significant interest from 
the vision community. As an application domain, CBIR poses new challenges 
for machine vision: since very few assumptions about the scenes to be analyzed 
are allowable, the only valid representations are those of a generic nature (and 
typically of low-level) and image understanding becomes even more complex 
than when stricter assumptions hold. Furthermore, large quantities of imagery 
must be processed, both off-line for database indexing and on-line for similarity 
evaluation, limiting the amount of processing per image that can be devoted to 
each stage. 

On the other hand, CBIR systems have access to feedback from human users 
that can be exploited to simplify the task of finding the desired images. This is a 
major departure from most previous vision applications and makes it feasible to 
build effective systems without having to solve the complete AI problem. In fact, 
a retrieval system is nothing more than an interface between an intelligent high- 
level system (the user’s brain) that can perform amazing feats in terms of visual 
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interpretation but is limited in speed, and a low-level system (the computer) 
that has very limited visual abilities but can perform low-level operations very 
efficiently. Therefore, the more successful retrieval systems will be those that 
make the user-machine interaction easier. 

The goal is to exploit as much as possible the strengths of the two players: 
the user can provide detailed feedback to guide the search when presented with a 
small set of meaningful images, the machine can rely on that feedback to quickly 
find the next best set of such images. To enable convergence to the desired target 
image, the low-level retrieval system cannot be completely dumb, but must know 
how to integrate all the information provided to it by the user over the entire 
course of interaction. If this were not the case it would simply keep oscillating 
between the image sets that best satisfied the latest indication from above, and 
convergence to the right solution would be difficult. 

This ability to learn by integrating information, must occur over various 
time scales. Some components maybe hard-coded into the low-level system from 
the start, e.g. the system may contain a specialized face-recognition module 
and therefore know how to recognize faces. Hard-coded modules are justifiable 
only for visual concepts that are likely to be of interest to most users. Most 
components should instead be learned over time, as different users will need to 
rely on retrieval systems that are suited for their tastes and personalities. While 
for some users, e.g. bird lovers, it maybe important to know how to recognize 
parrots, others could not care less about them. Fortunately, users interested in 
particular visual concepts will tend to search for those concepts quite often and 
there will be plenty of examples to learn from. Hence, the retrieval system can 
build internal concept representations and become progressively more apt at 
recognizing them as time progresses. We refer to such mechanisms as long-term 
learning or learning between retrieval sessions, i.e. learning that does not have 
to occur on-line, or even in the presence of the user. 

Information must also be integrated over short-time scales, e.g. during a 
particular retrieval session. In the absence of short-term or in-session learning, 
the user would have to keep repeating the information provided to the retrieval 
system from iteration to iteration. This would be cumbersome and extremely 
inefficient, since a significant portion of the computation performed by the latter 
would simply replicate what had been done in previous iterations. Unlike long- 
term learning, short-term learning must happen on-line and therefore has to be 
fast. 

In this paper we address the issue of learning in image databases by formu- 
lating image retrieval as a problem of Bayesian inference. This new formulation 
is shown to have various interesting properties. First, it provides the optimal 
solution to a meaningful and objective criteria for performance evaluation, the 
minimization of the retrieval error. Second, the complexity of a given query is 
a function only of the number of attributes specified in that query and not of 
the total number of attributes known by the system, which can therefore be 
virtually unlimited. Third, when combined with generative probabilistic repre- 
sentations for visual information it enables region-based queries without prior 
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image segmentation. Fourth, information from multiple user-specifications can 
be naturally integrated through belief propagation according to the laws of pro- 
bability. This not only allows the retrieval operation to take into account mul- 
tiple content modalities (e.g. text, audio, video, etc) but is shown to lead to 
optimal integration algorithms that are extremely simple, intuitive, and easy 
to implement. In result, it becomes relatively easy to design solutions for both 
short and long-term learning. We introduce two such mechanisms, and present 
experimental evidence that illustrates the clear benefits of learning for CBIR. 



2 Prior Work 



Even though the learning ability of a retrieval system is determined to a signifi- 
cant extent by its image representation, the overwhelming majority of the work 
in CBIR has been devoted to the design of the latter without much consideration 
about its impact on the former. In fact, a small subset of CBIR papers addresses 
the learning issue altogether 111214161/1 and even these are usually devoted to the 
issue of short-term learning (also known as relevance feedback). 

Two of the most interesting proposals for learning in CBIR, the “Four eyes” 
0 and “PicHunter” P) systems, are Bayesian in spirit. “Four eyes” pre-segments 
all the images in the database, and groups all the resulting regions. Learning 
consists of finding the groupings that maximize the product of the number of 
examples provided by the user with a prior grouping weight. “PicHunter” defines 
a set of actions that a user may take and, given the images retrieved at a given 
point, tries to estimate the probabilities of the actions the user will take next. 
Upon observation of these actions, Bayes rule gives the probability of each image 
in the database being the target. 

Because, in both of these systems, the underlying image representations and 
similarity criteria are not conducive to learning per se, they lead to solutions 
that are not completely satisfying. For example, because there is no easy way to 
define priors for region groupings, in ^ this is done through a greedy algorithm 
based on heuristics that are not always easy to justify or guaranteed to lead to 
an interesting solution. On the other hand, because user modeling is a difficult 
task, PI relies on several simplifying assumptions and heuristics to estimate 
action probabilities. These estimates can only be obtained through an ad-hoc 
function of image similarity which is hard to believe valid for all or even most 
of the users the system will encounter. Indeed it is not even clear that such a 
function can be derived when the action set becomes more complicated than 
that supported by the simple interface of “PicHunter” . 

All these problems are eliminated by our formulation, where all inferences 
are drawn directly from the observation of the image regions selected by the 
user. We show that by combining a probabilistic criteria for image similarity 
with a generative model for image representation there is no need for heuristic 
algorithms to learn priors or heuristic functions relating image similarity and 
the belief that a given image is the target. Under the new formulation, 1) the 
similarity function is, by definition, that belief and 2) prior learning follows na- 
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turally from belief propagation according to the laws of probability • Since all 
the necessary beliefs are an automatic outcome of the similarity evaluation and 
all previous interaction can be summarized in a small set of prior probabilities, 
this belief propagation is very simple, intuitive, and extremely efficient from the 
points of view of computation and storage. 



3 Retrieval as Bayesian Inference 

The retrieval problem is naturally formulated as one of statistical classification. 
Given a representation space T for the entries in the database, the design of a 
retrieval system consists of finding a map 

from T to the set M of classes identified as useful for the retrieval operation. 

In our work, we set as goal of content-based retrieval to minimize the proba- 
bility of retrieval error, i.e. the probability P(g(X) ^ y) that if the user provides 
the retrieval system with a query X drawn from class y the system will return 
images from a class g(X) different than y. Once the problem is formulated in 
this way, it is well known that the optimal map is the Bayes classifier 0 

g*(X) = argmaxP(j/ = j|X) (1) 

i 

= argmax{P(X|?/ = ■j)P( 2 / = z)}, (2) 

I 

where P(X|y = i) is the likelihood function for the class and P{y = i) the 
prior probability for this class. In the absence of prior information about which 
class is most suited for the query, an uninformative prior can be used and the 
optimal decision is the maximum likelihood (ML) criteria 

ff*(X) = argmaxP(X|j/ = i). (3) 

I 



3.1 Probabilistic Model 

To define a probabilistic model for the observed data, we assume that each ob- 
servation X is composed by A attributes X = {X(^\ . . . which, although 

marginally dependent, are independent given the knowledge of which class ge- 
nerated the query, i.e. 



P(X|y = z) = n^(X('=)|y = *)- (4) 

k 

Each attribute is simply a unit of information that contributes to the characte- 
rization of the content source. Possible examples include image features, audio 
samples, or text annotations. For a given retrieval operation, the user instantia- 
tes a subset of the A attributes. While text can be instantiated by the simple 
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specification of a few keywords, pictorial attributes are usually instantiated by 
example. 

Borrowing the terminology from the Bayesian network literature, we define, 
for a given query, a set of observed attributes O = and a set 

of hidden attributes H = is not instantiated by the user}, where Q 

is the query provided by the user. The likelihood of this query is then given by 

P(Q|y = z)=^P(0,H|y = z), (5) 

H 

where the summation is over all possible configurations of the hidden attribute^ 
Using i) and the fact that = *) = 

P(Q|y = *) = P(0|y = z)^ J] = 

H fc|X('“)GH 

= P{0\y = t) n = 

fe|X('=)GHX(*^) 

= P(0|y = z), (6) 

i.e. the likelihood of the query is simply the likelihood of the instantiated attribu- 
tes. In addition to intuitively correct, this result also has considerable practical 
significance. It means that retrieval complexity grows with the number of attri- 
butes specified by the user and not with the number of attributes known to the 
system, which can therefore be arbitrarily large. 

In domains, such as image databases, where it is difficult to replicate human 
judgments of similarity it is impossible to assure that the first response to a query 
will always include the intended database entries. It is therefore important to 
design retrieval systems that can take into account user feedback and tune their 
performance to best satisfy user demands. 

4 Bayesian Relevance Feedback 

We start by supposing that, instead of a single query X, we have a sequence 
of t queries Xj = {Xi, . . . ,Xtj, where t is a time stamp. From Q, by simple 
application of Bayes rule, the optimal map becomes 

y*(X}) = argmaxP(y = j|Xi, . . . ,Xt) 

I 

= argmax{P(Xt|y = z,Xi, . . . ,Xt_i)P(y = i|Xi, . . . ,Xt_i)| 

i 

= argmax{P(Xt|y = i)P{y = i|Xi, . . . ,Xt_i)j. (7) 

I 

Comparing m with 0 it is clear that the term P{y = f|Xi, . . . , Xt_i) is simply 
a prior belief on the ability of the image class to explain the query. However, 



^ The formulation is also valid in the case of continuous variables with summation 
replaced by integration. 
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unlike the straightforward application of the Bayesian criteria, this is not a static 
prior determined by some arbitrarily selected prior density. Instead, it is learned 
from the previous interaction between user and retrieval system and summarizes 
all the information in this interaction that is relevant for the decisions to be made 
in the future. 

Equation o is, therefore, a simple but intuitive mechanism to integrate 
information over time. It states that the system’s beliefs about the user’s interests 
at time t — 1 simply become the prior beliefs for iteration t. New data provided 
by the user at time t is then used to update these beliefs, which in turn become 
the priors for iteration t + 1. I.e. prior beliefs are continuously updated from the 
observation of the interaction between user and retrieval system. 

We call this type of behavior short-term learning or in-session learning. Star- 
ting from a given dataset (for example an image) and a few iterations of user 
feedback, the retrieval system tries to learn what classes in the database best 
satisfy the desires of the user. From a computational standpoint the procedure 
is very efficient since the only quantity that has to be computed at each time 
step is the likelihood of the data in the corresponding query. Notice that this is 
exactly what appears in o and would have to be computed even in the absence 
of any learning. In terms of memory requirements, the efficiency is even higher 
since the entire interaction history is reduced to a number per image class. It 
is an interesting fact that this number alone enables decisions that are optimal 
with respect to the entire interaction. 

By taking logarithms and solving for the recursion, 0 can also be written 
as 



This exposes a limitation of the belief propagation mechanism: for large t the 
contribution, to the right-hand side of the equation, of the new data provided by 
the user is very small, and the posterior probabilities tend to remain constant. 
This can be avoided by penalizing older terms with a decay factor at-k 



where at is a monotonically decreasing sequence. In particular, if at-k = a(l — 
a)^,a G (0, 1] we have 

g*{X.\) = argmax{o;logP(Xt|y = i) -h (1 - a)logP{y = i|Xi, . . . ,Xt_i)}. (9) 

I 

5 Combining Different Content Modalities 

So far we have not discussed in any detail what types of data can be modeled 
by the the attributes X^^^ of equation Q. Because there is no constraint for 
these attributes to be of the same type, the Bayesian framework can naturally 



g*(X‘) = argmax log P{Xt-k\y = i) + log P{y = i) ^ (8) 
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integrate many different modalities. In this work we restrict our attention to the 
integration of visual attributes with text annotations. 

Assuming a query = {T*, V^}, composed of both text (T*) and visual 
attributes (V‘), and using Q, and 0) 

5*(X‘)=arg ma.x{a log P{V t\y = i) + a log P{Tt\y = i) + {^- a) log = i|Xi"^)} 

(10) 

Disregarding the decay factor a, the comparison of this equation with (0 
reveals an alternative interpretation for Bayesian integration: the optimal class 
is the one which would best satisfy the visual query alone but with a prior con- 
sisting of the combination of the second and third terms in the equation. I.e. 
by instantiating text attributes, the user establishes a context for the evaluation 
of visual similarity that changes the system’s prior beliefs about which class is 
most likely to satisfy the visual query. Or, in other words, the text attributes 
provide a means to constrain the visual search. Similarly, the second term in the 
equation can be considered the likelihood function, with the combination of the 
first and the third forming the prior. In this interpretation, the visual attributes 
constrain what would be predominantly a text-based search. Both interpreta- 
tions illustrate the power of the Bayesian framework to take into account any 
available contextual information and naturally integrate information from diffe- 
rent sources. We next concentrate on the issue of finding good representations 
for text and visual attributes. 



6 Visual Representations 



We have recently introduced an image representation based on embedded mul- 
tiresolution mixture models that has several nice properties for the retrieval 
problem. Because the representation has been presented in detail elsewhere |Sj, 
here we provide only a high-level description. 

Images are characterized as groups of visual concepts (e.g. a picture of a 
snowy mountain under blue sky, is a grouping of the concepts “mountain”, 
“snow” and “sky” ) . Each image class in the database defines a probability den- 
sity over the universe of visual concepts and each concept defines a probability 
density over the space of image measurements (e.g. the space of image colors). 
Each image in the database is seen as a sample of independent and identically 
distributed feature vectors drawn from the density of one of the image classes 

Pi^tly = i) = Xl^(''*d|c(vtj) = k,y = i)P{c{vtj) = k\y = i), (11) 

j,k 



where are the feature vectors in V( and c{vtj) = k indicates that Vtj is a 
sample of concept k. The density associated with each concept can be either a 
Gaussian or a mixture of Gaussians, leading to a mixture of Gaussians for the 
overall density 
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This model is 1) able to approximate arbitrary densities and 2) computationally 
tractable on high dimensions (complexity only quadratic in the dimension of the 
feature space), avoiding the limitations of the Gaussian and histogram models 
traditionally used for image retrieval. 

When combined with a multi-resolution feature space, it defines a family 
of embedded densities across image resolutions that has been shown to provide 
precise control over the trade-off between retrieval accuracy, invariance, and com- 
plexity. We have shown that relying on the coefficients of the 8x8 discrete cosine 
transform (DCT) as features leads to 1) good performance across a large range 
of imagery (including texture, object, and generic image databases) and 2) per- 
ceptually more relevant similarity judgments than those achieved with previous 
approaches (including histograms, correlograms, several texture retrieval appro- 
aches and even weighted combinations of texture and color-representations) |E|. 

Finally, because the features vtj in (H2J can be any subset of a given query 
image, the retrieval criteria is valid for both region-based and image-based que- 
ries. I.e., the combination of the probabilistic retrieval criteria and a generative 
model for feature representation enables region-based queries without requiring 
image segmentation. 



7 Text Representation 

Given a set of text attributes X = . . . , known to the retrieval 

system, the instantiation of a particular attribute by the user is modeled as a 
Bernoulli random variable. Defining 

P(xW = l|y = i)=K., (13) 

and assuming that different attributes are independently distributed, this leads 
to 

logP(Tt\y = i) = ^4(jfO)=i)logPi,j (14) 

3 

where Ix=k = 1 li x = k and zero otherwise, and we have used 

7.1 Parameter Estimation 

There are several ways to estimate the parameters pij. The most straightforward 
is to use manual labeling, relying on the fact that many databases already include 
some form of textual annotations. For example, an animal database may be 
labeled for cats, dogs, horses, and so forth. In this case it suffices to associate 
the term “cats” with the term “dogs” with etc and make pi^i = 1 

for pictures with the cats label and pi^i = 0 otherwise, pi ^2 = 1 for pictures 
with the dogs label and pi ^2 = 0 otherwise and so forth. In response to a query 
instantiating the “cats” attribute, ll 1 4ll will return 0 for the images containing 
cats and — oo for those that do not. In terms of lliull (and associated discussion 
in section EJ this is a hard constraint: the specification of the textual attributes 
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eliminates from further consideration all the images that do not comply with 
them. 

Hard constraints are usually not desirable, both because there may be an- 
notation errors and because annotations are inherently subjective. For example 
while the annotator may place leopards outside the cats class, a given user may 
use the term “cats” when searching for leopards. A better solution is to rely on 
soft constraints where the pij are not restricted to be binary. In this case, the 
“cats” label could be assigned to leopard images, even though the probability 
associated with the assignment would be small. In this context pij should be 
thought of as the answer to the question “what is the probability that the user 
will instantiate attribute given that he/she is interested in images from class 
i?”. In practice, it is usually too time consuming to define all the pij manually 
and not clear how to decide on the probability assignments. A better alternative 
is to rely on learning. 



7.2 Long Term Learning 

Unlike the learning algorithms discussed in section 0 here we are talking about 
long-term learning or learning across retrieval sessions. The basic idea is to let 
the user attach a label to each of the regions that are provided as queries during 
the course of the normal interaction with the retrieval system. E.g., if in order 
to find a picture of a snowy mountain the user selects a region of sky, he/she has 
the option of labeling that region with the word “sky” establishing the “sky” 
attribute. 

Given K example regions {e^p, . . . of a given attribute whenever, 
in a subsequent retrieval session, the user instantiates that attribute, its proba- 
bility is simply the probability of the associated examples. I.e. d becomes 

log P(flt\y = i) = ^ls{xu))=i logP(eyi, . . . Cj^kIv = *)■ (15) 

j 

The assumption here is that when the user instantiates an attribute, he/she 
is looking for images that contain patterns similar to the examples previously 
provided. Since, assuming independence between examples, 

logP(ej,i, . ■■Cj.Kly = i) = ^\og P{epk\y = i) (16) 

k 

only the running sum of log P{cj^k\y = *) must be saved from session to session, 
there is no need to keep the examples themselves. Hence, the complexity is 
proportional to the number of classes in the database times the number of known 
attributes and, therefore, manageable. 

Grounding the annotation model directly in visual examples also guarantees 
that the beliefs of (d are of the same scale as those of li l 'Zi , making the applica- 
tion of ll 1 1 )ll straightforward. If different representations were used for annotations 
and visual attributes, one would have to define weighting factors to compensate 
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for the different scales of the corresponding beliefs. Determining such weights is 
usually not a simple task. 

There is, however, one problem with the example-based solution of (II3. 
While the complete set of examples of a given concept may be very diverse, indi- 
vidual image class models may not be able to account for all this diversity. In the 
case of “sky” discussed above, while there may be examples of sunsets, sunrises, 
and skies shot on cloudy, rainy or sunny days in the sky example set, particular 
image classes will probably not encompass all this variation. For example, ima- 
ges of “New York at sunset” will only explain well the sunset examples. Thus, 
while this class should receive a high rank with respect to “skyness” , there is no 
guarantee that this will happen, since it assigns low probability to a significant 
number of examples. 

The fact is that most image classes will only overlap partially with broad 
concept classes like sky. The problem can be solved by requiring the image 
classes to explain well only a subset of the examples. One solution is to rank the 
examples according to their probability and apply (I I dl) only to the top ones, 

R 

logP{Tt\y = i) = Is{xU))=i ^logP(eJ5|j/ = z), (17) 

3 ’■=1 

(r) 

where e) j. is the example region of rank r and R a small number (10 in our 
implementation) . 

8 Experimental Evaluation 

We performed experiments to evaluate the effectiveness of both short and long 
term learning. Because short term learning involves the selection, at each itera- 
tion, of the image regions to provide as next query it involves the segmentation 
of the query image. While this is not a problem for human users, it is difficult 
to simulate in an automated set up. To avoid this difficulty we relied on a pair 
of database for which segmentation ground truth is available. 




Fig. 1. Example mosaics from Brodatz (left) and Columbia (right). 



These databases were created from the well know Brodatz (texture) and 
Columbia (object) databases, by randomly selecting 4 images at a time and 
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making a 2 x 2 mosaic out of them. Each of the mosaic databases contains 2, 000 
images, two of which are shown in Figure Q Since the individual images are of 
quite distinct classes (texture vs objects), testing on both Brodatz and Columbia 
assures us that the results here presented should hold for databases of generic 
imagery. All experiments were based on the DCT of a 8 x 8 window sliding 
by two-pixel increments. Mixtures of 8 (16) Gaussians were used for Brodatz 
(Columbia). Only the first 16 DCT coefficients were used for retrieval. 

The goal of the short-term learning experiments was to determine if it is 
possible to reach a desired target image by starting from a weakly related one and 
providing feedback to the retrieval system. This is an iterative process where each 
iteration consists of selecting image regions, using them as queries for retrieval 
and examining the top V retrieved images. From these, the one with most sub- 
images in common with the target is selected to be the next query. One 8x8 
image neighborhood from each sub-image in the query was then used as an 
example if the texture or object depicted in that sub-image was also present 
in the target. Performance was averaged over 100 runs with randomly selected 
target images. 




Fig. 2. Plots of the convergence rate (left), and average number of iterations for con- 
vergence (right), for the Brodatz mosaic database. 



Figure El presents plots of the convergence rate and mean number of itera- 
tions until convergence as a function of the decay factor a and the number of 
matches V , for the Brodatz mosaic database (the results for Columbia are si- 
milar). In both cases the inclusion of learning (a < 1) always increases the rate 
of convergence. This increase can be very significant (as high as 15%) when V 
is small. Since users are typically not willing to go through various screens of 
images in order to pick the next query, these results show that learning leads 
to visible convergence improvements. In general, a precise selection of a is not 
crucial to achieve good convergence rates. In terms of the number of iterations, 
when convergence occurs it is usually very fast (from 4 to 8 iterations). 

Figure 0 illustrates the challenges faced by learning. It depicts a search for an 
image containing a plastic bottle, a container of adhesive tape, a clay cup, and a 
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white mug. The initial query is the clay cup. Since there are various objects made 
of wood in Columbia and these have surface properties similar to those of clay, 
precision is low for this particular query: only 4 of the 15 top matches are correct 
(top left picture). This makes the subset of the database that is considered to 
satisfy the query relatively large and the likelihood that other objects in the 
target will appear among the top matches is low. Consequently the feedback 
process must be carried for three iterations before a target object, other than 
that in the query, appears among the top matches. When this happens, the new 
object does not appear in the same image as the query object. 




Fig. 3. Four iterations of relevance feedback (shown in raster-scan order). For each 
iteration, the target image is shown at the top left and the query image immediately 
below. Shown above each retrieved image is the number of target objects it contains. 



In this situation, the most sensible option is to base the new query on the 
newly found target object (tape container). However, in the absence of learning, 
it is unlikely that the resulting matches will contain any instances of the query 
object used on the previous iterations (clay cup) or the objects that are confoun- 
ded with it. As illustrated by the bottom right picture, the role of learning is to 
favor images containing these objects. In particular, 7 of the 15 images returned 
in response to a query based on the tape container include the clay cup or similar 
objects (in addition to the tape container itself). This enables new queries based 
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on both target objects that considerably narrow down the number of candidates 
and, therefore, have a significantly higher chance of success. In this particular 
example it turns out that one of the returned images is the target itself but, 
even when this is not so, convergence takes only a few iterations. 



8.1 Long Term Learning 

The performance of a long-term learning algorithm will not be the same for all 
concepts to be learned. In fact, the learnability of a concept is a function of 
two main properties: visual diversity, and distinctiveness on the basis of local 
visual appearance. Diversity is responsible for misses, i.e. instances of the concept 
that cannot be detected because the learner has never seen anything like them. 
Distinctiveness is responsible for false positives, i.e. instances of other concepts 
that are confused with the desired one. Since the two properties are functions 
of the image representation, it is important to evaluate the learning of concepts 
from various points in the diversity/distinctiveness space. 

We relied on a subset of the Corel database (1,700 images from 17 classes) 
to evaluate long-term learning and identified 5 such concepts: a logo, tigers, sky, 
snow and vegetation. Since common variations on a given logo tend to be restric- 
ted to geometric transformations, logos are at the bottom of the diversity scale. 
Tigers (like most animals) are next: while no two tigers are exactly alike, they 
exhibit significant uniformity in visual appearance. However, they are usually 
subject to much stronger imaging transformations than logos (e.g. partial occlu- 
sion, lighting, perspective). Snow and sky are representative of the next level in 
visual diversity. Even though relatively simple concepts, their appearance varies 
a lot with factors like imaging conditions (e.g. shiny vs cloudy day) or the time 
of the day (e.g. sky at noon vs sky at sunset). Finally, vegetation encompasses a 
large amount of diversity. In terms of distinctiveness, logos rank at the top (at 
least for Corel where most images contain scenes from the real world), followed 
by tigers (few things look like a tiger), vegetation, sky and snow. Snow is clearly 
the less distinctive concept since large smooth white surfaces are common in 
many scenes (e.g. clouds, white walls, objects like tables or paper). 

To train the retrieval system, we annotated all the images in the database 
according to the presence or not of each of the 5 concepts. We then randomly 
selected a number of example images for each concept and manually segmen- 
ted the regions where concepts appeared. These regions were used as examples 
for the learner. Concept probabilities were estimated for each image outside the 
training set using (HU and, for each concept, the images were ranked according 
to these probabilities. Figure El a) presents the resulting precision/recall (PR) 
curves for the 5 concepts. Retrieval accuracy seems to be directly related to 
concept distinctiveness: a single training example is sufficient for perfect reco- 
gnition of the logo and with 20 examples the systems does very well on tigers, 
reasonably well on vegetation and sky, and poorly on snow. These are very good 
results, considering the reduced number of training examples and the fact that 
the degradation in performance is natural for difficult concepts. 
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Performance can usually be improved by including more examples in the 
training set, as this reduces the concept diversity problem. This is illustrated in 
Figure^b) and c) where we show the evolution of PR as a function of the number 
of training examples for sky and tigers. In both cases, there is a clear improve- 
ment over the one-example scenario. This is particularly significant, since this 
scenario is equivalent to the standard query-by-example (where users retrieve 
images of a concept by providing the system with one concept example) . As the 
figures clearly demonstrate, one example is usually not enough, and long-term 
learning does improve performance by a substantial amount. In the particular 
case of sky it is clear that performance can be made substantially better than in 
Figure El a) by considering more examples. On the other hand, figure Eld) shows 
that more examples make a difference only when performance is limited by a 
poor representation of concept diversity, not distinctiveness. For snow, where 
the latter is the real bottleneck, more examples do not make a difference. 




c) 



d) 



Fig. 4. Long-term learning, a) PR curves for the 5 concepts, b), c), and d) PR as a 
function of the training set size for sky [b)], tigers [c)], and snow [d)]. 



FigureElshows the top 25 matches for the tiger and sky concepts. It illustrates 
well how the new long term learning mechanism is robust with respect to concept 
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diversity, either in terms of different camera viewpoints, shading, occlusions, etc 
(tiger) and variations in visual appearance of the concept itself (sky) . 




Fig. 5. Top 25 matches for the tiger (left) and sky (right) concepts. 
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Abstract. A novel approach to colour-based object recognition and 
image retrieval -the multimodal neighbourhood signature- is proposed. 
Object appearance is represented by colour-based features computed 
from image neighbourhoods with multi-modal colour density function. 
Stable invariants are derived from modes of the density function that 
are robustly located by the mean shift algorithm. The problem of ex- 
tracting local invariant colour features is addressed directly, without a 
need for prior segmentation or edge detection. The signature is concise - 
an image is typically represented by a few hundred bytes, a few thousands 
for very complex scenes. 

The algorithm’s performance is first tested on a region-based image re- 
trieval task achieving a good (92%) hit rate at a speed of 600 image 
comparisons per second. The method is shown to operate successfully 
under changing illumination, viewpoint and object pose, as well as non- 
rigid object deformation, partial occlusion and the presence of backgro- 
und clutter dominating the scene. The performance of the multimodal 
neighbourhood signature method is also evaluated on a standard colour 
object recognition task using a publicly available dataset. Very good re- 
cognition performance (average match percentile 99.5%) was achieved 
in real time (average 0.28 seconds for recognising a single image) which 
compares favourably with results reported in the literature. 



1 Introduction 

Colour-based image and video retrieval has many applications and acceptable re- 
sults have been demonstrated by many research and commercial systems during 
the last decade m- Very often, applications require retrieval of images where the 
query object or region cover only a fractional part of the database image, a task 
essentially identical to appearance-based object recognition with unconstrained 
background. Retrieval and recognition based on object colours must take into 
account the factors that influence formation of colour images: viewing geometry, 
illumination conditions, sensor spectral sensitivities and surface reflectances. In 
many applications, illumination colour, intensity as well as view point and back- 
ground may change. Moreover, partial occlusion and deformation of non-rigid 
objects must also be taken into consideration. Consequently, invariance or at 
least robustness to these diverse factors is highly desirable. 
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Most current colour based retrieval systems utilise various versions of the 
colour histogram 123 which has proven useful for describing the colour content 
of the whole image. However, histogram matching cannot be directly applied to 
the problem of recognising objects that cover only a fraction of the scene. Moreo- 
ver, histograms are not invariant to varying illumination and not generally robust 
to background changes. Applying colour constancy methods to achieve illumi- 
nation invariance for histogram methods is possible but colour constancy itself 
poses a number of challenging problems 0. Other methods addressing image (as 
opposed to object) similarity are those using wavelets [E] and moments of the 
colour distribution ITTO . Finally, graph representations of colour content (like 
the colour adjacency graph US] and its extension to a hybrid graph m have 
provided good recognition for scenes with fairly simple colour structure. 

Departing from global methods, localised invariant features have been pro- 
posed in order to gain robustness to background changes, partial occlusion and 
varying illumination conditions. Histograms of colour ratios computed locally 
from pairs of neighbouring pixels for every image pixel 0 or across detected 
edges cni have been used. However, both methods are limited due to the global 
nature of histogram representation. In the same spirit, invariant ratio features 
have been extracted from nearby pixels across boundaries of segmented regions 
for object recognition pniTT?] . Absolute colour features have been extracted from 
segmented regions in pniTi . However, reliable image segmentation is arguably a 
notoriously difficult task P2C3. Other methods split the image into regions from 
where local colour features are computed. For example, the FOCUS system jS] 
constructs a graph of the modes of the colour distribution from every image 
block. However, not only extracting features from every image neighbourhood is 
inefficient, but also the features used do not account for illumination change. In 
addition, use of graph matching for image retrieval has often been criticised due 
to its relatively high complexity. 

We propose a method to address the colour indexing task by computing 
colour features from local image neighbourhoods with multimodal colour pro- 
bability density function. First, we detect multimodal neighbourhoods in the 
image using a robust mode estimator, the mean shift algorithm From the 
mode colours we are then able to compute a number of local invariant features 
depending on the adopted model of colour change. Under different assumptions, 
the resulting multimodal neighbourhood signatures (MNS) consist of colour ra- 
tios, chromaticities, raw colour values or combinations of the above. Our method 
improves on previous ones by 

- creating a signature which concisely represents the colour content of the 
image by stable measurements computed from neighbourhoods with infor- 
mative colour structure. Neither prior segmentation nor edge detection is 
needed. 

- computing invariant features from robustly filtered colour values representing 
local colour content 

- effectively using the constraints about the illumination change model thus 
resulting in a flexible colour signature 

- applying signature instead of histogram matching to identify and localise the 
query object in the database images 
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The advantages of computing features from detected multimodal neighbour- 
hoods are discussed in the next section. The algorithmic details of the approach 
and the implemented algorithm is described in section 0 Section 0 presents de- 
tails about the experimental setup and the results obtained are presented in 
section 0 Section 0 concludes the paper. 

2 The MNS Approach 

Consider an image region consisting of a small compact set of pixels. The shape 
of the region is not critical for our application. For convenience, we use regions 
defined as neighbourhoods around a central point. Depending on the number of 
modes of the probability distribution of the colour values, we characterise such 
regions as unimodal or, for more than one mode, multimodal neighbourhoods. 
Clearly, for unimodal neighbourhoods no illumination invariant features can be 
computed. We therefore focus on detected multimodal neighbourhoods. In par- 
ticular, multimodal neighbourhoods with more than two modes provide good 
characterisation of objects like the ball in Fig Did) and can result in efficient 
recognition on the basis of only few features. 




(a) (b) (c) (d) 



Fig. 1. Multimodal neighbourhood detection: (a) original image (b) randomised grid 
(c) detected bimodal neighbourhoods (d) detected trimodal neighbourhoods 



The advantages of extracting colour information from multimodal neighbour- 
hoods are many-fold. Local processing is robust to partial occlusion and defor- 
mation of non-rigid objects. Data reduction is achieved by extracting features 
only from a subset of all image neighbourhoods. Moreover, a rich description 
of colour content is obtained since a single colour patch can contribute to more 
than one neighbourhood feature computation. The computation time needed 
to create the colour signature is small since the most common neighbourhood 
type - unimodal neighbourhoods - are ignored after being detected very effi- 
ciently. Furthermore, illumination invariant features can be computed from the 
mode values to account for varying illumination conditions even within the same 
image. Regarding retrieval, region-based queries are efficiently handled and lo- 
calisation of the query instance in the database images is possible. Finally, the 
proposed representation allows the users to select exactly the local features they 
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are interested in from the set of the detected multimodal neighbourhoods of the 
query image. 

Computing colour invariants from detected multimodal neighbourhoods has 
certain advantages with respect to extracting features across detected edges. 
Most edge detection methods require an intensity gradient and a locally linear 
boundary. They often perform poorly at corners, junctions and regions with 
colour texture - exactly in those regions, where colour information can be highly 
discriminative. In addition, the multimodal neighbourhood approach directly 
formulates the problem of extracting colour features. 



3 The Algorithm 

3.1 Computing the MNS Signature 

The image plane is covered by a set of overlapping small compact regions. In the 
current implementation, rectangular neighbourhoods with dimensions (bx,by) 
were chosen. Compact regions of arbitrary shape - or even non-contiguous com- 
pact sets of pixels - could have been used. Rectangular neighbourhoods were 
selected since they facilitate simple and fast processing of the data. To avoid 
aliasing each rectangle is perturbed with a displacement with uniform distri- 
bution in the range [0,6x/2), [0, 6y/2), Fig. Q^b). To improve coverage of an 
image (or image region), more than one randomised grids can be used, slightly 
perturbed from each other. 

For every neighbourhood defined by such randomised grids, the modes of the 
colour distribution are computed with the mean shift algorithm described below. 
Modes with relatively small support are discarded as they usually represent 
noisy information. The neighbourhoods are then categorised according to their 
modality as unimodal, bimodal, trimodal etc. (e.g. see Fig. 

For the computation of the colour signature only multimodal neighbourhoods 
are considered. For every pair of mode colours rrii and rrij in each neighbourhood, 
we construct a vector v = {mi^rrij) in a joint 6-dimensional domain denoted 
RGB^ . In order to create an efficient image descriptor, we cluster the computed 
colour pairs in the RGB^ space and a representative vector for each cluster is 
stored. The colour signature we propose consists of the modes of the distribution 
in the RGB^ space. For the clustering, the mean shift algorithm is applied once 
more to establish the local maxima. The computed signature consists of a number 
of RGB^ vectors depending on the colour complexity of the scene. The resulting 
structure is, generally, very concise and flexible. 

Note that for the computation of the signature no assumption about the 
colour change model was needed. The parameters controlling mode seeking, that 
is the kernel width and the neighbourhood size are dependent on the database 
images; the former being related to the amount of filtering (smoothing) associa- 
ted with the mean shift and the latter depending on the scale of the scene. A 
multiscale extension of the algorithm, though relatively straightforward to im- 
plement (e.g. by applying the MNS computation to an image pyramid), has not 
yet been tested. 
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3.2 Computation of Neighbourhood Modality with the Mean Shift 
Algorithm 

To establish the location of a mode of the colour density function the mean shift 
algorithm is applied in the RGB domain. The general kernel-based estimate of 
a true multivariate density function f{x) at a point xq in a d-dimensional data 
space is given by 

i=l ^ ' 

where i = l..n are the sample data points and K is the kernel function with 
kernel width h. In this work, we are not interested in the value of the density 
function at the point Xg but rather in the location of its maxima locations in the 
data space. A simple and efficient algorithm for locating the maximum density 
points was proposed by Fukunaga 0 when the kernel function in m is the 
Epanechnikov kernel 



r ^(d -I- 2)(1 — x’^x) if x^x < 1 

= G " ' 0 otherwise 



( 2 ) 



where Cd is the volume of the unit d-dimensional sphere and x are the data 
points. The kernel has been shown to be robust to outliers and optimum in 
the sense of having minimum integrated square error in comparison with other 
kernels 0. 

The mechanism of the mean shift algorithm consists of iteratively shifting 
the kernel to the average of the data points within by the mean difference vector 



Mh{x) 



1 



^ (Xj - x) 

Xi^Sh{x) 



V/(^) 

^^+2 /(x) 



( 3 ) 



where is the number of data points inside the hypersphere S of radius h 
centred at x. Equation 0 is an estimate of the normalised gradient of the density 
function /(x) in the d-dimensional spac. As shown in |7], translation of the 
kernel centre towards the direction of the mean difference vector is equivalent 
to a gradient ascent to the local mode of the distribution. Convergence to the 
closest mode is guaranteed 0. 

Due to the non-linearity of the kernel, the filtering preserves discontinuities, 
details and retains local image structure. This is particularly important for ima- 
ges containing small objects like the swimmer’s cap in Fig. El The speed of the 
algorithm was tested experimentally, and convergence was very fast (typically 
4-5 iterations for complex data). Due to its advantageous properties the mean 
shift algorithm has been used in the past for image segmentation 0 and face 
tracking. For the MNS method, a computationally simple algorithm was imple- 
mented (see [II ti] for an efficient implementation). 

Replacing each pixel in the neighbourhood with the mode it converged to 
results in a filtered image like the one in Fig. . The filtered image is produced 
by replacing the value of each pixel pj , j = l..n of a neighbourhood with the 
closest mode rrij of the 3-dimensional colour density function using an iterative 
procedure: 
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Fig. 2. Robust filtering using the mean shift algorithm: (a) original image (b) filtered 
image (every neighbourhood pixel replaced by the mode of the density function it 
converged to) 



For each j = l..n 

1. Initialise i = 0 and set the current mode estimate to the value 
of the pixel pj 

2. Update the mode estimate ^ * f— t + 1 

until convergence i.e. until — m® < e 

3. Replace the value of pixel pj with the value of the local mode ruj 
it converged to. 



3.3 Computing Invariant Features from Multimodal 
Neighbourhoods 

From the multimodal neighbourhood signature, a number of invariant features 
can be computed. For the ease of exposition we will describe feature extraction 
from bimodal neighbourhoods which are the simplest multimodal ones. 

Consider a local image patch with two adjacent surfaces i and j. According to 
the monochromatic model of surface reflectance [imuoj the two estimated mode 
colours will be given by 

n = = SiPiCi 

r, = (i?„ G, ,B,) = s';g, c) , k = R,G,B 

where is the illumination factor, pi is the geometric factor and c^s the fc-th 
sensor response to the surface reflectance of patch i under white light (surface 
colour). Besides modelling the effects of change in viewpoint and object pose, 
the geometric factor p of the monochromatic model encompasses all factors that 
have the same effect on each colour channel, e.g. change of aperture or camera 
gain and change in illumination intensity. Coefficients represent factors that 
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effect individual colour channels, e.g. the change of illumination colour in the 
diagonal colour constancy model described below. 

A different image of the same surface colour pair under different light and 
object pose would change recorded colours to 

r' =(i?',G',i?') 

r' = , k = R,G,B 



respectively. 

Assuming constant illumination for both the database and the query scenes 
the colour change model for s'^ = = gc ,c = i,jand gi = gj becomes 

r' = ri and r' = rj and the simplest invariant colour features appear to be the 
mode colour values in the RGB^ space 

fsg = {Ri,Gi, Bi, Rj ,Gj, Bj) 

When s'^ = ,c = i,j , Y ~ y but gi ^ gj, orientation change is assumed 
to be same for both surfaces under constant light. Colour change is modelled by 

r'i = —n and r' = — r, 

5 * ^ 9 j 

In this case, various 5-dimensional features can be constructed from the mode 
chromaticities 

_ Rk _ Gk ; _ • ■ 

Rk + Gk + Bk^ " i?fc + Gfe + Bk 

and rational features. For example the 2 mode chromaticities and an intensity 
ratio produce the 5D feature vector 

t ( j \ n T fRi + Gi + Bi 

fg = [Xi,y^, Iij,Xj,yj) where Rj = { ) 

Rj + Gj + Bj 

proposed in m- 

When s'^ = ,c= i,j but gi ^ gj and % %, varying illumination inten- 

9i 9j 

sity is assumed due to surface discontinuities and orientation changes. Colour 
change is modelled as before and the chromaticities of the estimated mode 
colours are simple invariant 4-dimensional features 

fgd — {xi,yi, Xj , yj) 

The assumption of constant illumination within the same scene is viola- 
ted in most natural scenes. However, it is realistic to assume constant illu- 
mination colour in local image neighbourhoods. The diagonal model of illumi- 
nation change has been shown plausible when camera sensors are sufficiently 
narrow-band filters [B|. According to this model, illumination change is model- 
led by an independent scaling of the colour channels by a different constant i.e. 
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s'c = dkS^, c = J, j , dfe G M. It is easy to show that (assuming diagonal illu- 
mination change) the ratio of colours between two neighbouring surfaces with 
different colours is invariant to lighting changes PEI. Nevertheless, for the as- 
sumption to hold, the two neighbouring surfaces must have the same orientation 
i.e. gi = gj. Invariant features can be computed from the 3 colour channel ratios 
of the mode RGB values 



fR^ BA 

U/g/rJ 

In the most general situation, where orientation is different for the two sur- 
faces gi ^ gj and , the 2-dimensional cross-ratio vectors 

^ _(RAj G,B,\ 

Jcgd I E? ’ R / 

yLrjXtj ijrjJDi J 

are invariant under the diagonal model as shown in PH. 

Different invariants, but not necessarily independent, may be computed from 
a pair of RGB values (e.g. based on the hue-saturation colour model). We have 
not explored this issue. Invariants that could be obtained by exploiting higher 
order information from neighbourhoods with more than 2 modes have not been 
studied either. 



3.4 Matching Multimodal Neighbourhood Signatures 

A simple signature matching technique was applied to compute the dissimilarity 
between two MNS image signatures. The algorithm attempts to find a match 
for all model features assuming that the model signature contains only infor- 
mation about the object of interest. This assumption is realistic, since in object 
recognition applications a model database is typically built off-line in controlled 
conditions (e.g. with background allowing easy segmentation). In image retrie- 
val applications, the query region is delineated by the user. Sometimes the full 
image is the object of interest and its MNS description is an appropriate model. 
However, if only part of the image is covered by the object of interest and the 
full image descriptor is stored as a model, a loss in recognition performance is 
likely. 

On the other hand, test images may originate from scenes containing the 
model (query) object only as a fraction of the picture. The matching procedure 
is therefore asymmetric. A mismatch of a model feature is penalised whereas a 
mismatch of a test image feature is not. In other words the matching algorithm 
attempts to interpret the model signature as a distorted subset of the test image 
signature. 

Let I = l..n and J = l..m be the indices of the model and test features 
respectively. We define a match association function u{i) : I — ^ e I, 

mapping each model feature i to the test feature it matched or to 0 if it did not 
match. Similarly, a test association function v{j) : J — >■ 0|J/, j G J, maps a 
test feature to a model feature or to 0 in case of no match. A single threshold 
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Th defines the maximum allowed distance between two matching features. The 
matching problem, i.e the problem of uniquely associating each feature sf^,i = 
l..n of the model signature with a test feature sj ,j = l..m and the computation 
of a match score is resolved in the following 4 steps: 

1. Set u{i) = 0 and v{j) = 0 Vi, j. From each signature s compute 
the invariant features , /J’ according to the colour change model 
dictated by the application. 

2. Compute all pairwise distances dtj = d{f^,fj') between the model 
and test features. 

3. Set u{i) = j, v{j) = i if dij < du and dij < Th Vk,l with 
u{k) = 0 and v(l) = 0. 

4. Compute signature dissimilarity as 

D{s^,s^)= Y. E (4) 

(V i (V i :u(z)— 0) 

Computing overall image similarity, the quality of the model features that 
matched is taken into account and the score is penalised for any unmatched 
model features. Note that features are allowed to match only once. In general, 
the more model features matched, the lower the D{s^,s'^) value and the more 
similar the compared images. 



3.5 Computing Feature Distances 

Let V = (va,Vh) and u = (ua,Uh) be two vectors in the RGB^ space. The 
adopted distance function is the sum of the square norms of the pairwise vector 
component differences 

dRGB^{v,u) = min{\\Va - Mail + ||Mb - Mbil, ||Ma - Ub\\ + ||Mf, - Ma||} (5) 



Taking the minimum distance between the original and component- wise inverted 
vectors is necessary because the order of the mode values in the joint vectors is 
not fixed. 

Various distance functions can be defined for the chromaticity and ratio- 
nal features. We chose the same function O for measuring distance in the 4D 
joint chromaticity domain. A distance for 5D features was proposed in For 
matching relative features, a simple formula was devised 



d fracijPi 9 ) 



\a * d — b * c\ 
y/a + b + c + d 



( 6 ) 



where P = f and q = 2 1-dimensional fractions. The distance between the 
colour ratio between two RGB values pi, pj defined as 



ri = {fRTGTh) 



(Y91 
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and another ratio V 2 = (''’Ij; ^b) between two other colours qi and qj is then 

dratiTl, T2) = — {dfradj’R^ f F;) + <^/rac(?'G) ^g) dfraci^Bt *"b)) C^) 

The modification of drat to measure the distance between 2-dimensional cross- 
ratios is trivial, ignoring one colour channel in (Q. 




Fig. 3. Query selection and representative multimodal neighbourhoods 




(a) (b) (c) (d) 



Fig. 4. Sample target images demonstrating possible cases of: (a) background clutter, 
(b) non-rigid deformation, (c) illumination change and (d) object size 



4 Data and Experimental Setup 

4.1 Image Retrieval Experiment 

We tested the suitability of the multimodal neighbourhood signature method 
for region-based image retrieval using a 30 minute video sequence of a BBC 
summary of the Atlanta Olympic games. The objective of the experiment was 
to retrieve frames that involved Irish events or athletes, therefore we searched 
for the presence of the Irish national colours in the image database. 

In total, 145 frames were randomly chosen from the sequence yielding a 
database of very different images, taken both indoors and outdoors. Object pose, 
scale as well as illumination was arbitrary (Fig.EI). No image was removed from 
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the original selection and no image preprocessing was applied. The size of each 
frame was 176 x 144 pixels. The query image was a rectangular region, a part 
of an Irish flag (Fig.|3). The MNS signatures were constructed very fast using a 
more efficient implementation described in US). The multimodal signatures for 
the database images were computed in 0.1 seconds on average and the query 
signature was computed in less than 0.1 seconds on a SUN Ultra Enterprise 450 
with quad 400MHz UltraSPARC-II CPUs. The average signature size for the 
database images was 900 bytes. 

In order to evaluate performance 13 “target” images containing the Irish 
colours were included in the database. The target images were manually sel- 
ected from the same sequence as the database images and represented scenes 
of very different content. Objects containing the sought colours in the target 
images were often Irish flags sometimes occluded, non-rigidly deformed and/or 
of various sizes (Fig. 0). Sometimes, the frames were taken at shot transitions 
where video editing effects were apparent. Finally, illumination conditions chan- 
ged dramatically between some of the frames resulting in completely different 
recorded colours. For example compare image 0 with image 0c) taken in the 
evening under very different light. Images 0 and 0 can be viewed in colour 
in [TT!j . 




Fig. 5. Sample database images 



The parameters involved in the computation of the signature and matching 
were not especially tuned for the task. Two consecutive randomised grid searches 
were performed with the same neighbourhood size (8x8 pixels) and the resulting 
multimodal neighbourhoods were merged into a larger set before clustering and 
computing the signature. Informative higher order features that are available 
at multimodal neighbourhoods with more than 2 modes were not exploited in 
the reported experiments. For the mean shift algorithm, a fixed kernel width of 
25 units was used for the detection in the RGB space and 20 for the joint 6D 
space. Modes with support less than 10% of the neighbourhood were considered 
insignificant and therefore ignored. Low intensity modes (less than 5 percent 
of the luminance scale) were also not taken into account to improve stability 
especially in the case of relative colour feature matching. Although ratios from 
pixels with saturated (clipped) colours are not expected to be stable, we did not 
remove saturated colours for the reported experiments. The matching threshold 
was also fixed and was dependent only on the nature of the features used. For 
example, for RGB^ feature matching, matching threshold was fixed to 100 for 
the proposed distance function ©• 
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4.2 Object Recognition Experiment 



To compare MNS performance with results reported in the literature, we perfor- 
med a well known colour object recognition experiment using a dataset collected 
by M. Swain. The database is publicly available Pj and has been used in a number 
of colour recognition experiments (e.g. [2419121 p . The model image set consisted 
of 66 household objects imaged on black background under the same light (for a 
full colour image of the database see 123 ) . The test set consisted of 32 images, 
a subset of model objects rotated, displaced or deformed (e.g. clothes). The test 
database and the corresponding model objects are shown in Fig.O 

MNS performance evaluation was identical to Funt and Finlayson’s 0 where 
ratio histogram matching was used for recognition. However, for that experiment, 
11 model and 8 test images were removed from the database due to saturated 
pixels whereas we used all images. The same experiment was repeated by Park et 
al. |2 1 j using a colour adjacency graph representation of image colour structure. 

Computation of each MNS signature took 0.1 seconds on average. Image size 
was 128 X 90 pixels for both the model and test image sets. The average signature 
size was 150 bytes m- No image preprocessing, subsampling or smoothing was 
applied before signature computation. All internal parameters (mean shift kernel 
width, neighbourhood size etc) were exactly the same as those used for the image 
retrieval experiment. 



5 Results 

5.1 Image Retrieval 

We first report results on the retrieval task from a database of sport images 
described in section^] Database images were matched to the query image (Fig.0 
and sorted by their similarity to the query. Performance was evaluated according 
to the percentage of relevant images that were retrieved in the top 20 ranks of 
the retrieved list. In general, retrieval was very fast. A single signature match 
score was computed in approximately 1.5 ms i.e. the retrieval proceeds at 600 
matches/sec on average. The results are presented in Fig.Elas plots of percentage 
of relevant images as a function of the the number of retrieved images. 

Retrieval results varied depending on the feature representation used. Assu- 
ming constant illumination for all images, 6-dimensional feature matching was 
applied which resulted in 12 out of 13 Irish images being in the top 20 ranks, a 
hit rate of 92.3%. However, the remaining image was ranked 83. The same retrie- 
val experiment was repeated using the 4-dimensional chromaticity vectors. Hit 
rate was 61.5% but the worst rank was 53. We repeated the retrieval experiment 
for 3-dimensional ratio feature matching using only simple ratios as described in 
section roi The worst match was 46 but hit rate was only 30.7%. 

When matching based on absolute colour values, the colour constancy pro- 
blem is apparent. An image with the query object of similar size (Fig. EIc)) 
had a very low similarity score due to the significant change in the colour of 
illumination. Matching illumination invariant features, like the chromaticities or 
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(a) Test objects used for the recognition experiment 




(b) Model objects corresponding to the tests in (a) 



Fig. 6. Sample test and model images from Swain’s database 



the ratios mentioned above, improved performance assigning a high rank to the 
previously missed image. Although the hit rate was not so good as in the first 
case, all relevant images were retrieved within a smaller subset of the retrieved 
image set. Clearly, there is a trade-off between the hit rate as defined above and 
the invariance to the illumination change. In the case where illumination colour 
was indeed constant, the higher dimensionality features benefited from their 
higher discriminative power. However, in changing illumination conditions, only 
relative invariant features were able to correctly rank the images even though 
discrimination was not as good. 
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(a) (b) (c) (d) 



Fig. 7. Examples of Swain’s model images with very similar red-white regions: (a) clam 
chowder can, (b) chicken soup can, (c) mickey underwear and (d) red-white jumper 




(a) 



(b) 



(c) 



Fig. 8. Retrieval results for different colour invariant features using : (a) 6D RGB^ 
features (b) 4D chromaticity features (c) 3D ratio features 




(a) (b) (c) (d) 

Fig. 9. Neighbourhoods that matched with the Irish flag query : (a) stripes on jumper 
(b) deformed flag in crowd (c) small flag on cap (I) (d) small flag on cap (II) 



5.2 Colour Object Recognition 

Assuming that illumination was kept approximately constant for all images in 
Swain’s database the multimodal neighbourhood signature was tested using 6D 
RGB^ feature matching. For each test object, signature dissimilarity from 66 
model signatures (of the models described in section^ was computed and the 
rank of the correct pair stored. For a single test object the recognition process 
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Table 1. Comparative colour object recognition results for Swain’s database 



Method 


Rank 


Average Match 
Percentile 


Number of test/model 
images 


1 


2 


3 


>3 


MNS 


27 


2 


2 


1 


0.995 


32 / 66 


CC Colour Indexing 


22 


2 


0 


0 


0.998 


24 / 55 


Colour Indexing 


29 


3 


0 


0 


0.999 


32 / 66 


Hybrid graph 


32 


0 


0 


0 


1.000 


32 / 66 



took 0.28 seconds, i.e. 4 msec per image match. Speed was still real-time alt- 
hough slower than retrieval since the models were more complex than the Irish 
flag query image. To allow comparison with previous experiments, recognition 
performance of the algorithm was assessed in terms of the average match per- 
centile. The match percentile for each image matched is defined as where N 
is the number of model images and r is the rank of the model image containing 
the test object. 

Results are presented in Table ^ Recognition performance is compared with 
reported results for the colour indexing 121 , colour constant (CC) colour inde- 
xing 13 and hybrid graph m representations respectively. Recognition using 
the MNS compared favourably to the other three algorithms with an average 
match percentile of 99.5% using the default MNS parameters. 

The objects that were not classified as rank 1 include mainly objects with 
red-white colour boundaries (e.g. Fig. H). Such object are common in Swain’s 
database and their MNS signature is similar. 

Histograms record areas (or relative areas if normalised) and have no pro- 
blems discriminating between objects with almost identical colours but with 
different sizes of colour region. For Swain’s database this property is beneficial, 
since most objects undergo only rotations and translations and have approxima- 
tely the same scale. Consequently, MNS is outperformed, although the difference 
seems insignificant. In the presence of occlusion, object deformation or general 
view point change (e.g. as in the image retrieval experiment above) reliance on 
non-invariant and/or global property like area or relative area will negatively af- 
fect performance. The best reported result for this dataset was achieved by the 
hybrid colour adjacency graph which incorporates information about the spatial 
arrangement of colours in the image. Although the MNS representation allows 
for localisation of matching regions, we did not demonstrate this feature in the 
reported experiments. 

6 Conclusions 

In this paper, a novel approach to colour-based object recognition and image re- 
trieval, the Multimodal Neighbourhood Signature (MNS) , was presented. The 
proposed method directly formulates the problem of representing object colour 
appearance by computing signatures of colour features derived from robust esti- 
mates of the modes of a local colour density function. From a multimodal neig- 
hbourhood signature, a number of invariants were computed to address changes 
in the imaging conditions within the application environment. In addition, by 
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computing features from image neighbourhoods, the MNS method facilitates 
region-based query specification and image retrieval. 

We demonstrated our algorithm’s performance on a region-based image re- 
trieval task and a good (92%) hit rate was achieved in real time (600 image 
matches/sec on a SUN Ultra Enterprise 450 with quad 400MHz UltraSPARC- 
II CPUs). Relevant images were successfully retrieved regardless of background 
clutter, partial occlusion or non-rigid object deformation. In particular, very 
small regions were successfully matched like the small Irish flags on the swim- 
mer’s caps (Fig. El- In addition, the trade-off between hit rate and illumination 
invariance was apparent in the reported experiments. Regarding colour object 
recognition, the MNS representation was tested on a standard dataset and com- 
pared favourably with three well known recognition algorithms. Very good per- 
formance (average match percentile 99.5%) was achieved with default settings, 
identical to those used in the image retrieval experiment. In general, the MNS 
signatures were concise and thus significant data reduction was achieved. An 
image was typically represented by a few hundred bytes, a few thousands for 
very complex scenes. 

Future improvements to the algorithm include introducing a training/lear- 
ning stage to efficiently exploit discriminative colour characteristics inherent 
to the database at hand, and a multiscale approach to compensate for scale 
changes. Selection of an appropriate distance for colour invariants, especially 
those taking the form of a ratio, should be investigated. Finally, we intend to 
study the potential of multimodal neighbourhoods with more than two modes 
for recognition and retrieval. 
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Abstract. This paper introduces a new multi-view reconstruction problem called 
approximate N-view stereo. The goal of this problem is to recover a one- 
parameter family of volumes that are increasingly tighter supersets of an unknown, 
arbitrarily- shaped 3D scene. By studying 3D shapes that reproduce the input pho- 
tographs up to a special image transformation called a shuffle transformation, we 
prove that (1) these shapes can be organized hierarchically into nested supersets of 
the scene, and (2) they can be computed using a simple algorithm called Approxi- 
mate Space Carving that is provably-correct for arbitrary discrete scenes (i.e., for 
unknown, arbitrarily-shaped Lambertian scenes that are defined by a finite set of 
voxels and are viewed from N arbitrarily-distributed viewpoints inside or around 
them). The approach is specifically designed to attack practical reconstruction 
problems, including fl) recovering shape from images with inaccurate calibration 
information, and (2) building coarse scene models from multiple views. 



1 Introduction 

The reconstruction of 3D objects and environments from photographs is becoming a key 
element in many applications that simulate physical interaction with the real world (e.g., 
[1]). Unfortunately, despite significant recent progress on the topic of N-view stereo [1- 
9], there are many practical reconstruction problems for which a general solution is 
beyond the current state of the art. Examples include (1) reconstructing an unknown 
scene from images with inaccurate calibration information [10, 11], (2) reconstructing 
a scene that is not perfectly stationary (e.g., a person that moved slightly between each 
snapshot [11]), and (3) recovering a coarse scene model either for efficiency reasons 
[12, 13], or because the scene’s geometry is exceedingly complex (e.g., a tree full of 
leaves, viewed from a distance). 

As a first step toward addressing this limitation, in this paper we consider a new 
multi-view reconstruction problem called approximate N-view stereo. The goal of this 
problem is to recover a one-parameter family of volumes that are increasingly tighter 
supersets of the true scene. Working from first principles, we show that each of the above 
reconstruction tasks can be thought of as instances of the approximate W-view stereo 
problem and, as such, they can be solved by recovering approximate, rather than exact, 
3D shapes from multiple views. Moreover, we provide a detailed geometrical analysis 
of the approximate W-view stereo problem and describe a computational framework 
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D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 67-83, 2000. 
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for provably solving it for discrete scenes in the general case (i.e., for an unknown, 
arbitrarily-shaped Lambertian scene that is viewed from N arbitrarily-distributed view- 
points inside or around it). Experimental results illustrating the applicability of our 
theoretical development are given for real scenes of complex geometry. 

Our approach is motivated by the general question of how to recover 3D scene 
approximations from multiple views. We argue that any answer to this question should 
be evaluated according to the following criteria: 

- Direct recovery: The approximation should be computable directly from the input 
images, i.e., without first computing an exact reconstruction. 

- Generality: Computations should rely as little as possible on assumptions about 
the distribution of the input viewpoints (e.g., nearby views or minimal occlusion), 
about the scene’s true shape, or about the existence of specific image features in the 
input views (e.g., edges, points, lines, contours, texture, or color). 

- Level-of-detail control: It should be possible to control the degree to which the 
recovered shape approximates the shape of the true scene (to the extent allowed by 
the image data). 

- Shape determinism: It should be possible to quantify the geometric relationship 
between the recovered approximation and the shape of the true scene. 

- Reconstrnctihility conditions: It should possible to state the conditions under 
which the approximation process is valid and/or breaks down. 

- Robustness: Approximate reconstruction should be possible even when ideal stereo 
conditions (e.g., good calibration, scene rigidity) are not satisfied. 

Unfortunately, serious difficulties arise when attempting to fulfill the above criteria using 
existing approaches. 

First, current theories for 3D shape approximation (e.g., scale space [14], mesh 
simplification [15], wavelet descriptions [16], and hierarchical volume representations 
[17]) define the notion of an “approximate 3D shape” in terms of the exact geometry 
of the shape being approximated. These theories are therefore not directly applicable 
when exact reconstructions are unavailable. 

Second, even though recent approaches to A-view stereo have demonstrated good 
results even under very general conditions about the scene and the input viewpoints 
[5, 6, 8], they cannot be easily extended to achieve approximate reconstruction. Key to 
their success is the use of provably-correct algorithms to determine the scene points 
visible in each view. Little is known, however, about whether visibility determination 
is well defined and tractable when recovering approximate shapes from images. This 
is because even a small deviation from the shape of the true scene can imply dramatic 
changes in visibility [18], and because the input images will not be consistent, in general, 
with point visibilities defined by an approximate shape. 

Third, existing stereo methods rely on the assumption that under ideal conditions, 
every point on a reconstructed 3D shape should project to points that can be matched 
across views. Unfortunately, this assumption cannot be used as a basis for designing 
approximate shape recovery algorithms because it breaks down in that case. Hence, even 
though recent particle- and mesh-based stereo techniques allow level-of-detail control 
[9,11,19, 20] , their reliance on regularization criteria that enforce this assumption makes 
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it impossible to analyze their behavior during approximate reconstruction (e.g., existence 
of local minima, convergence properties, shape determinism). 

Fourth, existing attempts to recover approximate shapes from multiple views rely 
exclusively on a two-step method that involves (1) reducing the resolution of the input 
views, and (2) recovering low-resolution shapes from the reduced images [12, 13]. In 
general, it is impossible to guarantee that low-resolution pixels spanning large depth or 
color discontinuities will be matched across views [12]. Approaches that rely on this 
method are therefore largely heuristic. 

Fifth, with the exception of [11], no previous techniques have recognized the tight 
inter-dependence between the problem of approximate reconstruction and that of recon- 
struction in the presence of calibration errors. As a result, only partial studies exist for 
handling these problems.' 

In order to overcome the limitations of existing approaches, our work is based on 
a simple idea: rather than define the notion of an “approximate 3D shape” directly in 
terms of 3D geometry, we define it indirectly, in terms of its appearance. To make this 
idea concrete, we use a class of image transformations suitable for describing a process 
of “controlled image approximation,” and define approximate 3D shapes to be volumes 
in space that reproduce the input photographs up to a transformation in this class. 

The two crucial questions one must address to exploit this idea are (1) how to 
relate these implicitly-defined approximate shapes to the geometry of the true scene, 
and (2) how to recover such shapes from images of arbitrarily-shaped scenes. Here 
we show that both questions can be answered with the help of a special class of image 
transformations called shuffle transformations. These transformations describe arbitrary 
bounded re-arrangements of points in an image and have a unique property — the views of 
the true scene are always related to the views of its supersets by a shuffle transformation. 
Moreover, we can use these transformations to arrange a scene’s supersets into a one- 
parameter family of nested volumes, called the Photo-Consistency Scale Space, whose 
appearance converges to the input views and whose shapes provide increasingly tighter 
bounds on the true scene. Importantly, we show that shapes from this scale space can 
be computed using a simple, efficient and provably-correct volumetric algorithm that 
fulfills our stated evaluation criteria to a great degree. To our knowledge, this is the only 
algorithm with this property. 

In the following we consider a scene to be an arbitrary, bounded, and opaque volume 
V that is viewed from arbitrary viewpoints in — V. To simplify our exposition, we first 
focus on the case where volumes are not finite, i.e., they are open subsets of bounded 
by closed surfaces [21]; where images are functions of color or intensity defined over a 
continuous domain [0, Ff] x [0, W]; where pixels are infinitesimally- small points; and 
where no pixel noise is present. We then consider finite volumes in Section 5 and discrete 
images in Section 6 and in the Appendix. 



' For instance, the shape distortion analysis in [10] cannot be extended from 2 to A erroneously- 
calibrated views because no single 3D shape will be consistent, in general, with N such 
views — in such cases, only approximate shapes can be recovered. Similarly, the coarse-to-fine 
reconstmction method of Frock and Dyer [12] still requires accurate calibration. 
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Figure 1. Shuffle transformations, (a) A 1 -shuffle corresponding to a piecewise-continuous image 
translation, (b) A 1-shuffle corresponding to a non-parametric image transformation; this example 
shows that r-shuffles can model the process of “ignoring” a subset of the pixels in an input image. 
(c),(d) A randomized 10-shuffle: the image in (d) was created by displacing every pixel in (c) 
to a randomly-selected position inside a 21x21 pixel window centered at the pixel. Note that (d) 
appears “blurred” even though no modification of colors or intensities has taken place. 



2 Approximate N-View Stereo 

A basic step in our method for recovering approximate 3D shapes is to apply in a novel 
way the principle of transformation-invariant reconstruction [22]: given a collection of 
input photographs, recover a 3D shape that is defined up to an a priori specified class 
of transformations. Below we apply this principle in an appearance based way by (1) 
defining the class of shuffle image transformations, and (2) defining approximate recon- 
struction as the problem of recovering a volume that reproduces the input photographs 
up to a shuffle transformation. 

2.1 Shuffle Transformations 

We use the term shuffle transformation to denote any image transformation (continuous 
or otherwise) that causes the bounded repositioning of pixels in an image. Shuffle 
transformations are defined implicitly, in terms of their effect on a source image: 

Definition 1 (r-Shuffle) A 2D transformation T : h ^ h is called an r-shuffle if for 
every point in image I 2 we can find a point of identical color within a disk of radius r 
in Ii . The constant r > 0 is called the dispersion radius of T. 

Shuffle transformations affect only the arrangement of pixels in an image, not their 
actual colors or intensities (Figure la). Because these transformations encompass a 
wide range of distortions, algorithms that can recover shape from images known up 
to an r-shuffle are by definition invariant to bounded parametric, non-parametric, and 
statistically-defined image distortions (Figure Ib-d). 

2.2 Reconstruction Using Photo-Consistency 

Before we can make precise the approximate reconstruction problem, we need to relate 
the set of input views of an unknown scene to the 3D reconstructions they give rise to 
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in the ideal case, i.e., when an exact 3D reconstruction is sought. We use the photo- 
consistency theory of Kutulakos and Seitz [6] for this purpose, briefly summarized 
below. 

The notion of photo-consistency is based on the idea that a reconstructed 3D shape 
should reproduce a scene’s input photographs exactly if it is to be considered a valid 
geometric description of that scene. This leads to a geometric constraint satisfaction 
problem, where every input photograph can be thought of as a constraint that restricts 
the space of all possible 3D shapes to only those shapes that could have possibly 
produced that photograph. When many such photographs are available, each taken from 
a known position c, , they define an equivalence class of 3D shape solutions called photo- 
consistent shapes whose views are identical to the input photographs when viewed from 
the photographs’ viewpoints (Figure 2a): 

Definition 2 (Photo-Consistent 3D Shape) An arbitrary finite and opaque volume V 
is photo-consistent if there is an assignment of radiances (colors) to every point on its 
surface such that V’s projection along the known viewpoints ci , . . . , cat is identical to 
the corresponding input photographs. 

Using this definition as a starting point, photo-consistency theory studies the recon- 
struction of photo-consistent shapes from N arbitrarily-distributed views that are taken 
from known positions in space. In particular, Kutulakos and Seitz proved the following: 

Theorem 1 (Photo Hull Theorem) For every volume V that contains the true scene, 
there exists a unique shape, called the Photo Hull, that is the union of all photo-consistent 
subsets of V. Moreover, this shape contains the true scene and is itself a photo-consistent 
shape. 

The notion of the photo hull plays a key role in our approach for three reasons. First, 
it provides a direct mathematical link between a scene’s appearance in N images and 
the reconstruction(s) these images can give rise to. Second, it defines in an algorithm- 
independent way the tightest possible superset of the true scene that we can ever recover 
from N photographs, in the absence of a priori scene information. This is because it is 
impossible to decide, based on the photographs alone, which photo-consistent subset of 
this maximal shape is the true scene. The notion of the photo hull is therefore especially 
important in order to evaluate the results of our approximate reconstruction approach. 
Third, it leads to a simple, efficient, and provably-correct volumetric algorithm for 
computing this shape — starting from an arbitrary superset V of the scene itself (e.g., an 
arbitrary “bounding box” that surrounds a physical 3D object), the algorithm iteratively 
“carves” voxels away from this superset until the carved volume converges to the 
photo hull. As such, photo-consistency theory leads to algorithms that satisfy both the 
Generality and Shape Determinism criteria posed in Section 1. 



2.3 Approximation Using Shuffle Transformations 

Unfortunately, despite its useful features, photo-consistency theory relies on precise 
knowledge of the input viewpoints and provides no mechanism for approximate shape 
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Radiance Projection Photo- r- consistency 

assignment consistency 




(a) 



(b) 



Figure 2. (a) Photo-consistent shapes. The color and intensity at the projection of every point on 
their surface must be identical to the input photographs, (b) r-consistent shapes. The color and 
intensity at the projection of every point on their surface must be identical to that of a pixel witbin 
a disk of radius r in tbe input photographs (shown in gray). 



recovery. We therefore relax the definition of a photo-consistent 3D shape in a way that 
makes 3D shape approximation a mathematically tractable problem: 

Definition 3 (r-Consistent 3D Shape) A volume V is r-consistent if for every input 
photograph there exists an r-shuffle Ti ■ li ^ I[ that makes V photo-consistent with 
the photographs I[, ... 

r-consistent shapes are the central concept in approximate W-view stereo. Intu- 
itively, these shapes satisfy two seemingly-incompatible requirements. On one hand, a 
valid 3D scene approximation must be globally consistent with the input photographs, 
i.e., every point on its surface must conform to the textures, discontinuities, and occlu- 
sion relationships captured by the entire set of the input views. On the other, such an 
approximation cannot reproduce the input views exactly since it only approximates the 
scene itself. 

When a point on an r-consistent shape is projected into a pair of photographs in 
which the point is visible, it induces an implicit correspondence between a pair of disks 
(Figures 2b, 3a). This correspondence can be interpreted in terms of a simple criterion 
for matching two or more sets of pixels: 

Definition 4 (Color Equivalence of Pixel Sets) Two pixel sets are color equivalent if 
there is a pixel color that appears in both. 

r-consistent shapes can therefore be thought of as integrating into a global, N- 
view stereo framework elements from recent non-parametric [23-25] and robust [26] 
approaches to image matching. 

3 The Photo-Consistency Scale Space 

Our definition of r-consistency leads directly to a hierarchy of 3D scene approxima- 
tions in which the dispersion radius, r, controls how well the appearance of the 3D 
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approximation matches that of the true scene. To use this hierarchy as a framework for 
developing reconstruction algorithms, however, we must establish the 3D relationship 
between an r-consistent shape and the true scene. 

We make this relationship precise with the following theorem which suggests that 
the dispersion radius can be thought of as a “scale” parameter that controls the detail in 
an r-consistent shape. Theorem 2 shows that the relationship between representations 
of the same scene at different scales is one of containment, i.e., r-consistent shapes at 
coarse scales are guaranteed to contain counterparts at finer scales (Figure 3a): ^ 

Theorem 2 (Nesting Theorem) Let V be an arbitrary bounded superset of the scene: 

1 . 0-consistency is equivalent to photo-consistency; 

2. r-consistency implies ri -consistency for every ri > r; 

3. every superset of the scene is r-consistent for some r > 0; 

4. if Vr C V is an r-consistent shape that contains the scene, then for every r\ > r we 
can find an ri -consistent subset of V; equivalently, for every 0 < r 2 < r, we can 
find an r 2 -consistent superset of the scene that is contained in Vr- 

The Nesting Theorem has three implications for approximate shape recovery. First, 
it suggests that the recovery of a photo-consistent or an r-consistent shape from pho- 
tographs can always be formulated as a “coarse-to-fine” reconstruction process in which 
r-consistent shapes of increasingly smaller dispersion radius are recovered at each step, 
starting from an arbitrary initial volume that bounds the scene. Second, it shows that 
this process provides a way to reconstruct “controlled” scene approximations that (1) 
always bound the true scene, and (2) provide an increasingly accurate representation of 
both a scene’s 3D shape and of its appearance. Third, it suggests that we can establish an 
explicit bound on a scene’s 3D shape even from images that do not correspond to exact 
views of a rigid 3D scene. This is because the images of an r-consistent shape are defined 
only up to a shuffle transformation, and hence, any collection of r-shuffle-transformed 
views of a rigid scene are sufficient to determine such a shape. 

4 Reconstruction Using Free-Space Queries 

Our goal is to recover r-consistent shapes from multiple views of an arbitrary scene. To 
do this, we exploit the Nesting Theorem by repeatedly applying to an arbitrary superset 
of the scene an operation akin to “space carving:” given an ri -consistent superset with 
ri > r, remove selected portions of this volume so that the resulting “carved” shape 
becomes r-consistent. In this section, we show how to perform this carving operation 
with the help of a provably-correct criterion for testing whether a portion of a non 
r-consistent volume is completely devoid of scene points. 

The key observation we exploit to define this criterion is that even though r- 
consistency was defined in a purely appearance-based way, the r-consistency and non 

^ The reader should note that our approach differs from existing scale space theories (e.g., [14]) 
in two significant ways: (1) unlike previous formulations, the mapping between r-consistent 
shapes at different scales is not continuous in general, and (2) this mapping is defined in terms 
of the shapes’ appearance rather than their geometry. 
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r-consistency of a shape provides explicit 3D geometric information about the under- 
lying scene. In particular, let V be an arbitrary non r-consistent superset of the scene. 
By its definition, V must contain a point p on its surface such that (1) p is visible from 
a subset {ci , . . . ,Ck} of the input viewpoints, ^ and (2) the disks, Di, . . . , Dj., that 
are centered at p’s projection in the corresponding photographs are not color equiva- 
lent. Let TZi,i = 1, . . . , A: be the interior of the conical volumes defined by c, and D,, 
respectively. We use the following theorem (Figures 3c, d): 

Theorem 3 (Free-Space Query Theorem) (1) If the volume TZp = fl^^i 
occluded hy V — TZp from any of the viewpoints = 1, . . . ,k, TZp is free 
of scene points. (2) If TZ'p is the subset of TZp that is unoccluded hy V — TZp from 
Ci,i = 1, . . . , A:, the volume TZ'p fl V is free of scene points. 

Theorem 3 gives a deterministic sufficiency criterion for “querying” the free space 
inside a non r-consistent volume by simply comparing disks around the projection of a 
single point on the volume’s surface: 

Corollary 1 (Free-Space Query Criterion) If the disks L>i , . . . , are not color equiv- 

alent, there exists an identifiable volume c )2 that contains no scene points. 

5 The Approximate Space Carving Algorithm 

The Free Space Query criterion leads directly to a simple volumetric reconstruction 
algorithm that, given a dispersion radius r and an arbitrary superset V of the scene, 
iteratively removes portions of that volume until it becomes r-consistent. 

In particular, the Free Space Theorem tells us that if there is a point p on V’s surface 
that satisfies this criterion, the volume V' = V — Vf must still contain the scene. 
Furthermore, the Nesting Theorem guarantees the existence of an r-consistent superset 
of the scene in the volume V'. Hence, if the scene consists of a finite set of points (e.g., 
voxels) and only the points in are removed at each iteration, the carved volume will 
converge to an r-consistent shape. These considerations lead to the following algorithm 
for computing an r-consistent shape: 

Approximate Space Carving Algorithm 
Step 1 : Initialize V to a superset of the scene. 

Step 2: Repeat the following until no surface point p can be selected in Step 2b: 

a. Project p to all viewpoints ci , . . . , Cj in which it is visible and let T>i , . . . , Dj be 

disks of radius r around p’s projection in the corresponding photographs. 

b. Select p if no single pixel color appears in all disks. 

Step 3: Let be the largest volume in V that contains p, is fully visible in ci , . . . , Cj, 
and projects to the interior of Di, ... ,Dj, respectively. 

Step 4; Set V = V — Vf, and continue with Step 2. 



^ More formally, we consider p € V to be visible to a set of cameras {ci , . . . , c* } if there exists 
a volume C V around p whose occluders lie entirely in That is, for every q € V' , the 
open line segment qa does not intersect V — V' for any camera a in the set. 
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Figure 3. (a) Illustration of r-consistency for a shape V (shown in yellow) that contains the true 
scene. The scene’s volume is shown in hlack. If the distance between the projection of p and q 
in the right-hand image is d, V will he r-consistent for every r > d/2, (b) Illustration of the 
Nesting Theorem. The blue volume represents the Photo Hull, (c)-(d) Illustrations of the Free- 
Space Theorem, (c) The intersection TZf = TZi fl TZ-i is unoccluded by V; the volume Vf, shown 
in dark blue, must be free of scene points — if it were not, at least one scene point would be visible 
to both cameras, forcing the color equivalence of the disks Di,D- 2 . (d) The intersection TZi fl 7J.2 
is occluded by V; the volume Vf is restricted to the intersection of the visible portion of TZp. 




Figure 4. (a)-(b) Illustration of the Calibration Reconstructibility Theorem, (a) The initial volume 
V, shown in yellow, is a bounding box of the scene. Yellow cameras indicate the true viewpoints, 
a. If the incorrect viewpoints c*, shown in red, do not cross the dotted lines, the volume V 
contains a reconstructible r-consistent shape. The theorem also provides an easily-computable 
reconstructibility test: if V itself is not r-consistent for any r, we cannot use the Approximate 
Space Carving Algorithm to reconstruct the scene, (b) Another application of the theorem. The 
circles around each camera are a priori bounds on the calibration error of the input views. An 
r-consistent shape can be recovered from the circular initial volume, V, by allowing the red 
camera to contribute to the carving of V only around surface points outside the red regions, i.e., 
points whose visibility is not affected by calibration errors, (c)-(d) Handling discrete images, (c) 
The difference between the values of the corresponding red pixels, whose footprints contain no 
irradiance discontinuities, tends to zero with increasing image resolution. On the other hand, even 
though the middle pixels in the left and right image form a corresponding stereo pair, their actual 
values will be arbitrary mixtures of red and green and their similarity cannot be guaranteed for any 
image resolution, (d) Relating a discrete image to a 2-pixel-shuffle of its continuous counterpart. 
For every pixel in the discrete image whose footprint does not span a discontinuity, there is a point 
in the continuous image that has identical value and falls inside the footprint (circle). Hence, we 
obtain a 2-pixel-shuffle of the continuous image by replacing the value of every corrupted pixel 
in the discrete image by that of an adjacent pixel that does not lie on a discontinuity. 






76 K.N. Kutulakos 



6 Applications of the Theory 

Reconstruction in the Presence of Calibration Errors Because r-consistent recon- 
struction is invariant to shuffle transformations of the input images, it leads to algorithms 
that operate with predictable performance even when these views are not perfectly cali- 
brated. More specifically, let V be an arbitrary superset of the true scene, let ci , . . . , cat 
be the viewpoints from which the input photographs were taken, and let ci , . . . , cjv be 
incorrect estimates for these viewpoints, respectively. 

Theorem 4 (Calibration Reconstructibility Theorem) An r-consistent subset of V 
exists for some r > 0 if the following condition holds for all i = 1, . . . , N : 



Visv(cj) = Visv(cj), 



( 1 ) 



where Visy (c) is the set of points on V that are visible from c. 

Theorem 4 tells us that Approximate Space Carving can provably handle arbitrarily 
large calibration errors, as long as these errors do not affect the visibility of the initial 
volume given as input to the algorithm. This result is important because the conditions 
it sets are independent of the true scene — they only depend on the shape V, which is 
known in advance (Figure 4)f In practice, even though large calibration errors may 
allow reconstruction of an r-consistent shape, they affect the minimum r for which 
this can happen. Hence, good information about a camera’s calibration parameters is 
still needed to obtain r-consistent shapes that tightly bound the true scene. A key open 
question is whether it is possible to employ the approximate correspondences defined 
by an r-consistent shape to achieve self-calibration, i.e., improve camera calibration 
and recover even tighter bounds on the true scene [1 1, 27]. 

Reconstruction from Discrete Images In practice, pixels have a finite spatial extent 
and hence their color is an integral of the image irradiance over a non-zero solid 
angle. This makes it difficult to match individual pixels across views in a principled 
way, especially when the pixels span irradiance discontinuities (Figure 4c). Typical 
approaches use a threshold for pixel matching that is large enough to account for 
variations in the appearance of such “corrupted” pixels [8,12]. Unfortunately, these 
variations depend on the radiance properties of the individual scene, they do not conform 
to simple models of image noise (e.g., Gaussian), and cannot be bounded for any finite 
image resolution. Unlike existing techniques, approximate A-view stereo puts forth 
an alternative approach; rather than attempting to model the appearance of corrupted 
pixels to match them across frames [3], we simply ignore these pixels altogether by 
recovering r-consistent shapes from the input views. Intuitively, this works because 
(1) a view of an r-consistent shape must agree with its corresponding discrete input 
image only up to a shuffle transformation, and (2) shuffle transformations are powerful 
enough to model the elimination of corrupted pixels from an input view by “pushing” 

It is possible to analyze in a similar way shape recovery when the scene moves between 
snapshots — reconstmction then involves computing r-consistent supersets of both the original 
and the new scene volume. 
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all discontinuities to pixel boundaries (Figures lb and 4d). This behavior can be thought 
of as a generalization of the neiborhood-based method of Birchfield and Tomasi [28] 
where pixel (dis)similiarities are evaluated by comparing sets of intensities within their 
neighborhoods along an epipolar line. From a technical point of view, it is possible 
to establish a resolution criterion that is a sufficient condition for reconstructing r- 
consistent shapes from discrete images. This criterion formalizes the conditions under 
which the reasoning of Figure 4d is valid; it is omitted here due to lack of space. 

Coarse-to-Fine Reconstruction While the Approximate Space Carving Algorithm in Section 
5 can be thought of as operating in a continuous domain, it can be easily converted into 
an algorithm that recovers volumetric scene models in a coarse-to-fine fashion [17]. 
The algorithm works by imposing a coarse discretization on the initial volume V and 
iteratively carving voxels away from it. The key question one must address is how to 
decide whether or not to carve a “coarse voxel” away. Figures 4c,d suggest that using 
lower- resolution images to achieve this [12,13] will only aggravate the correspondence- 
finding process, and may lead to wrong reconstructions (i.e., coarse volumes that do not 
contain the scene). Instead, we apply the Free-Space Query Theorem: to decide whether 
or not to carve away a voxel v, we pick a dispersion radius r that is specific for that 
voxel and is large enough to guarantee that volume Vf contains v in its interior. Hence, 
carving coarse models simply requires establishing the color equivalence of disks in 
each image that are of the appropriate size. Moreover, the Nesting Theorem tells us 
that this approach not only guarantees that coarse reconstructions contain the scene; it 
also ensures that they can be used to derive higher-resolution ones by simply removing 
high-resolution voxels. 



7 Experimental Results 



To demonstrate the applicability of our approach, we performed several experiments 
with real image sequences. In all examples, we used a coarse-to-fine volumetric imple- 
mentation of the Approximate Space Carving Algorithm as outlined in Section 6. No 
background subtraction was performed on any of the input sequences. We relied on the 
algorithm in the Appendix to test the color equivalence of disks across views. Voxels 
were assigned a color whose RGB values led to component success in this test. 

Coarse-to-fine reconstruction: We first ran the Approximate Space Carving Algo- 
rithm on 16 images of a gargoyle sculpture, using a small threshold of 4% for the color 
equivalence test. This threshold, along with the voxel size and the 3D coordinates of a 
bounding box containing the object were the only parameters given as input to our im- 
plementation. The sub-pixel calibration error in this sequence enabled a high-resolution 
reconstruction that was carved out of a 256^ voxel volume. The maximum dispersion 
radius was approximately two pixels. Figure 5 also shows coarse reconstructions of the 
same scene carved out of 128^ , 64^ , and 32^ voxel volumes. The only parameter changed 
to obtain these volumes was the voxel size. Note that the reconstructions preserve the 




78 K.N. Kutulakos 



object’s overall shape and appearance despite the large dispersion radii associated with 
them (over 40 pixels for the 32^ reconstruction). 

Invariance nnder bonnded image distortions: To test the algorithm’s ability to re- 
construct 3D shapes from “approximate” images, we ran the approximate space carving 
algorithm on artificially-modified versions of fhe gargoyle sequence. In the first such 
experiment, we shifted each input image by a random 2D translation of maximum 
length d along each axis. These modifications result in a maximum dispersion error 
of 2\/2d for corresponding pixels. The modified images can be thought of either (1) 
as erroneously-calibrated input images, whose viewpoint contains a translation error 
parallel to the image plane, or (2) as snapshots of a moving object taken at different 
time instants. Figure 5 shows a 128^ reconstruction obtained for d = 3 in which the 
approximate space carving algorithm was applied with exactly the same parameters as 
those used for the original images. Despite the large dispersion errors, the recovered 
shape is almost identical to that of the “error-free” case, even though the algorithm was 
run without modifying any of the input parameters. This is precisely as predicted by our 
theory: since the dispersion radius associated with each voxel is larger than 8.5 pixels, 
the recovered r-consistent shape is guaranteed to be invariant to such errors. A 64^ 
reconstruction for d = 10 is also shown in the figure, corresponding to dispersion errors 
of over 22 pixels. In a second experiment, we individually shifted every pixel within 
a 21 X 21 -pixel window, for every image of the gargoyle sequence (Figure Id shows 
one image from the modified sequence). Figure 5 shows a 64^ reconstruction from 
this sequence that was obtained by running the approximate space carving algorithm 
with exactly the same parameters as those used to obtained the 64^ reconstruction for 
the original sequence. These results suggest that our approach can handle very large 
errors and image distortions without requiring any assumptions about the scene’s shape, 
its appearance, or the input viewpoints, and without having to control any additional 
parameters to achieve this. 

Reconstruction from images with mixed pixels: In order to test our algorithm’s 
performance on sequences where many of the image pixels span color and intensity 
discontinuities, we ran the algorithm on 30 calibrated images of four cactus plants. The 
complex geometry of the cacti creates a difficult stereo problem in which the frequent 
discontinuities and discretization effects become significant. Our results show that the 
dispersion radius implied by a 128^ volume results in a good reconstruction of the scene 
using a low 7% RGB component error. Despite the presence of a significant number 
of “mixed” pixels in the input data, which make stereo correspondences difficult to 
establish, the reconstruction does not show evidence of “over-carving” (Figure 5). 

Reconstruction from images with calihration errors: To test invariance against cal- 
ibration errors, we ran the approximate space carving algorithm on 45 images of an 
African violet. While calibration information for each of these views was available, it 
was not extremely accurate, resulting in correspondence errors of between 1 and 3 pixels 
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in the input views. ^ Figure 6 shows results of the reconstructed 3D shape, achieved by 
reconstructing a 128^ volume from the input views. This result suggests that the dis- 
persion radius implied by the 128^ reconstruction was sufficient to handle inaccuracies 
in the calibration of each view. An important future direction for our work will be to 
examine how to use this approximately-reconstructed shape for self-calibration, i.e., for 
further improving the calibration information of each view and, hence, increasing the 
maximum attainable resolution of the recovered shape. 

Comparison to standard space carving: In a final set of experiments, we compared 
our approach to a volumetric reconstruction algorithm that does not incorporate an 
error-tolerance model. To achieve this, we applied an implementation of the space carv- 
ing algorithm [6] to the same sequences of images used in the above experiments, 
with exactly the same parameters. To obtain a coarse scene reconstruction of the gar- 
goyle, we ran the space carving algorithm on a volumetric grid of the desired resolution 
(i.e., a cube of 128^ voxels for a 128^ reconstruction). In all of our coarse reconstruc- 
tion runs at 128^, 64^, 32^ resolutions, the space carving algorithm completely failed 
to reconstruct the scene; instead, the entire set of voxels in the starting volume was 
carved away resulting in empty reconstructions. Intuitively, since larger voxels project 
to larger regions in the input views, their projection spans pixels with significant inten- 
sity variation, invalidating the voxel consistency criterion employed by that algorithm. 
Our approximate reconstruction technique, on the other hand, not only produces valid 
coarse reconstructions, but does so even when the input views are distorted significantly. 

We next ran the standard space carving algorithm on the cactus sequence. In this 
case, even a reconstruction of 256^, where every voxel projects to approximately one 
pixel, led to over-carving in some parts of the scene (Figure 7). Attempts to reconstruct 
lower-resolution volumes with the algorithm led to almost complete carving of the 
input volume. A similar behavior was exhibited with the violet sequence. In this case, 
calibration errors in the input views led to significant over-carving even for the highest- 
possible volume resolution of 256^, where every voxel projects to about one pixel. This 
suggests that our approximate reconstruction algorithm does not suffer from the original 
algorithm’s sensitivity to even relatively small calibration errors. 

8 Concluding Remarks 

Despite our encouraging results, several important questions still remain unanswered. 
These include (1) how to determine the minimum dispersion radius that leads to 
valid scene reconstructions, (2) how to apply our analysis to the problem of joint 
reconstruction/self-calibration from N views, and (3) how to develop adaptive, multi- 
resolution reconstruction methods in which different parts of the same scene are ap- 
proximated to different degrees. 

While invariance under shuffle transformations leads to increased robustness against 
calibration errors and discretization effects, we believe that shuffle-invariant reconstruc- 
tion is only a first step toward a general theory of approximate shape recovery. This is 

^ This error range was established a posteriori from images of a test pattern that was viewed 
from approximately the same camera positions but that was not used for calibration. 
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because calibration errors and image distortions are “structured” to some extent, and 
rarely correspond to truly arbitrary bounded pixel repositionings. As such, our current 
investigation can be thought of as treating a worst-case formulation of the approximate 
N-view stereo problem. The question of how to limit the class of acceptable image dis- 
tortions without compromising robustness in the shape recovery process is a key aspect 
of our ongoing research. 

Appendix: Implementation of the Color Equivalence Test Step 2b of the Approx- 
imate Space Carving Algorithm requires determining whether two or more sets of n 
pixels share a common color. This requires O(n^) pixel comparisons for two n-pixel 
disks (i.e., examining every possible pixel pair (gi, 52 ) with qi € Di and 52 G £’ 2 )- To 
improve efficiency, we use the following algorithm, which requires only 0{kn log n) 
pixel comparisons for k disks: 

Step 0 Return success if and only if Steps 1-3 return component success for every 
color component C = In this case, assign color (hr, hg, I-^b) to the 

associated voxel. 

Step 1 Let Ri,i = 1, ... ,khs the array of values for component C in the pixels of . 
Step 2 Sort R\ and repeat for each disk Di,i = 2, . . . ,k: 

a. sort Ri ; 

b. for every value i?i [j], find its closest value, Rj, in i?,. 

Step 3 For every value Ri [j], compute the standard deviation, aj, of the values in the 
sst {Ri[j], Ri,. .. ,Rl). 

Step 4 If minj aj < r, where r is a threshold, return the mean, /ic, of the component 
values along wtih component success; otherwise, return failure. 

This algorithm is equivalent to the Color Equivalence Test when Di degenerates to a 
pixel and weakens with increasing disk size. The threshold r is chosen by assuming 
Gaussian noise of known standard deviation for every pixel component. Specifically, 
the sample variance, (t| , follows a distribution with k — 1 degrees of freedom in 
the case of component success. The threshold can therefore be chosen to ensure that 
component success is returned with high probability when the disks share a common 
value for component C [29]. 
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Figures. Experimental results. Also shown are the number of surface voxels, V , as well as the 
computation time, T, in seconds. 
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Figure 6. Reconstruction in the presence of calibration errors. Four out of 46 images of the 
sequence are shown. A view of the reconstructed 3D shape is shown on the right, obtained after 
full convergence of the approximate space carving algorithm. 




256®, std. space carving 





128®, std. space carving 



256®, std. space carving 



Figure 7. Comparing reconstructions computed with the standard and approximate space carving 
algorithm. Left column: Views of the reconstmcted cacti. Over-carving is noticeable in the upper- 
right part of the recovered volume. Right columns, top row: More views of the reconstruction 
shown in Figure 6. Right columns, bottom row: Volume generated by standard space carving 
before the algorithm reached complete convergence. Note that even at this stage, significant over- 
carving is evident throughout the volume; the final reconstruction contains even fewer voxels, 
because over-carved regions cause errors in the algorithm’s visibility computations. 
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Abstract. In this paper we derive a minimal set of sufficient constraints 
in order for 27 numbers to constitute a trifocal tensor. It is shown that, 
in general, eight nonlinear algebraic constraints are enough. This result 
is in accordance with the theoretically expected number of eight inde- 
pendent constraints and novel since the to date known sets of sufficient 
constraints contain at least 12 conditions. Up to now, research and for- 
mulation of constraints for the trifocal tensor has concentrated mainly on 
the correlation slices and has produced sets of constraints that are neit- 
her minimal (> 12) nor independent. We show that by turning attention 
from correlation to homographic slices, simple geometric considerations 
yield the desired result. Having the minimal set of constraints is impor- 
tant for constrained estimation of the tensor, as well as for deepening 
the understanding of the multiple view relations that are valid in the 
projective framework. 



1 Introduction 

The last decade has seen a great number of publications in the area of multi- 
ple view vision and consequently an enormous progress in practical as well as 
theoretical aspects. One of the most interesting and intriguing theoretical con- 
structions is the trifocal tensor that appears as the connecting block between the 
homogeneous coordinates of image points and/or image lines over three views. 
It is already very difficult even to try to cite all scientific work that has been 
published in the last years around this topic. The trifocal tensor has appeared 
first in disguised form in in the calibrated case and in ng. Trilinear rela- 
tionships in the uncalibrated case were first published in HH and the tensorial 
description in 0, H2| and H3I Simultaneous using of corresponding points and 
lines has appeard in ^ and necessary and sufficient conditions were formulated 
in PI and [0|. 

In this paper we will focus attention on deriving a minimal set of constraints. 
It is seen that eight nonlinear constraints are enough. This is a very satisfying 
result because the theoretically expected number has been eight and the minimal 
set to date known contained 12 constraints (not independent). 
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In order to make the material as self contained as possible we give in Section 
2 a derivation of the trifocal tensor. Overlap with existing literature is unavoi- 
dable, however many things appear in new form. In Section 3 we describe the 
so called correlation slices of the tensor that have been the basis for 

developing sufficient constraints. In Section 4 we turn attention from correlation 
to homographic slices (□, p, in]) and show that this is the key for obtaining a 
minimal set of constraints. In Section 5 we formulate our geometric and algebraic 
conditions and we conclude in Section 6. 



2 The Trifocal Tensor 

In this section we will give a derivation of the trifocal tensor in closed factored 
form in dependence on camera matrices and at the same time fix our notation. 
We note here already that we are not going to use tensor notation and that the 
indices I, J and K that will appear are not tensorial indices but just a means to 
distinguish the views. 

2.1 One View 

Consider a scene point in the threedimensional projective space represented 
by a fourdimensional vector X containing the homogeneous coordinates of the 
point. A projective camera I represented by a 3 x 4 matrix P/ will map the space 
point onto a point x/ S of the image plane I containing its three homogeneous 
coordinates. This mapping looks linear and reads 



X/-P/X. (1) 

Since by using homogeneous coordinates the scale is unimportant and unreco- 
verable we use the sign ^ instead of = . We will be using the = sign only if 
strict numerical equality between the two sides is meant. If our knowledge about 
the location of a space point is limited only in the knowledge of its image x/ 
we can only say about X that it must lie on the optical ray B/(x/) containing 
the image point x/ and the camera center C/ G . The camera center fulfills 
the equation P/C/ = 0 since it is the only point in for which there isn’t any 
image (of course under the idealizing assumption that the camera “sees” in all 
directions). We obtain a parametric representation of the possible locations of 
X if we try to “solve” the equation above. Since P/ is not square it does not 
have an inverse. However, using the pseudoinverse m p ^ of P/ we obtain 

X-P+X/-PC/A 

and every that way computed point X satisfies equation ([5. We will always 
assume that camera matrices have rank three. In that case we have P)" = 
P|’(P/Pf)“^ and C/ ~ (Pji'P/ — l 4 )v where I 4 is the 4x4 identity and v 
may be any arbitrary 4-vector (not in the range of Pj) since (P/P/ — I 4 ) has 
rank I. All possible different solutions X are obtained by varying the parameter 
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A. We note that P/x/ is a point on the ray B/(x/) that is guaranteed not to be 
coincident with C/. That enables us to write down the nonparametric equation 
of the ray B/(x/) based on two distinct points P^x/ and C/ that define the 
ray: 

B,(x/)^P+x,Cf-C,xfP+^ 

B/(x/) is a 4 X 4 skew symmetrical matrix of rank two that contains the Pliicker 
coordinates of the ray m- This representation has the advantage that one can 
immediately give the intersection point of the ray with some given plane tt. 
Thus, if we know that the space point X is on the plane tt (X^tt = 0 ) then we 
will have: 

X^B,(x/)7T 



2.2 Two Views 

Next we consider a second image xj ~ PjX of X provided by a second camera 
J with camera matrix Pj. For the above considered point we then obtain the 
relation xj ~ P,7B/(x/)7r ^ P,7(P^x/Cj — C7 x|’P^^)7t. As it is easily seen, 
the dependence of xj on X 7 is linear. To make this linearity explicit we make 
twice use of the algebraic law vec(ABC) = (A (g) C^)vec(B) where vec denotes 
the row-wise convertion of some matrix to a column vector and (E> denotes the 
Kronecker product and obtain 



Xj-Hj 7 ( 7 t)x 7 (2) 

where Hj7(7t) is given by the following formula: 

Hj7 (7 t) - (Pj ® 7T^)(l4 ® C7 - C7 0 l4)P| (3) 

H ,77 (tt) is a homography matrix that assigns to every point X7 of image plane 
I an image point xj of image plane J due to the plane tt. Put differently, by a 
homography mapping xj ^ Hj7(7t)x 7 the two optical rays B 7 (x 7 ) and Bj(xj) 
are guaranteed to intersect on the plane tt. Multiplying out we also obtain for 
the above homography: 

H,77(7t) - PjP+ 0 7r^C7 - P,7C77T^P + 

-H77 0 7 T^C 7 -ej 77 r^P+ ( 4 ) 

Here we have denoted by ejj := PjC7 the epipole of camera center C7 on image 
plane J and by Hj7 (without argument) the product PjP)!" which is a special 
homography due to the plane with the same homogeneous coordinates like C7 
(tt ^ C7) as it is easily seen from above. This is a plane that is guaranteed not to 
contain the camera center C7. Now multiplying equation (0 on the left by the 
3 x 3 rank two skew symmetrical matrix [077] x that computes the cross product 
we obtain [ej7]xXj ~ [ej7]xHj7(7r)x7. The product [ej7]xHj7(7r) referred to 
as the fundamental matrix Fj7 



F.JI ^ [ej7]xHj7(7r) 



( 5 ) 
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is easily seen from it not to depend on the plane tt (at least up to an un- 
important scale factor). Furthermore, it is invariant with respect to projective 
transformations of the space . As such, it is well known that the fundamen- 
tal matrix encapsulates the projective structure of two views and computes for 
every point x/ on image plane I the epipolar line denoted \ji in image plane J 
on which xj is constrained to lie: 

[ej/]xXj ^ Fj/x/ (6) 

Not less well known is the fact that the fundamental matrix depends on 7 in- 
dependent parameters (2 x 11 for the two camera matrices minus 15 for the 
projective freedom) and that consequently the nine elements of Fj/ are not 
independent and should fulfil one constraint besides the free scale. As it is ap- 
parent from equation o this constraint must be det(Fj/) = 0 since [ej/]x and 
consequently also Fj/ has rank two. What in the case of two cameras has been 
so easy to derive, is in the case of three cameras to be described below by far 
not as trivial. 

Concluding this short two- views exposition we also mention some well known 
but important facts concerning homographies and fundamental matrices that 
will be useful in the sequel: 

— Homographies are in general regular matrices 

— The inverse of a regular homography is a homography due to the same plane 
but with interchanged image planes: Hj/(7t)“^ ~ H/j(7r) 

— Any homography Hj/(7t) maps the epipole e/j onto the epipole ej/ 

— The map of a point x/ under a homography Hj/(7t) is on the epipolar line 
Aj 7 , i.e. AJ7Hj/(7t)x/ = 0 V tt 

— The transpose of a homography maps corresponding epipolar lines onto one 
another, i.e. Hj/(7t)^Aj/ ^ A/j 

— The right and left null spaces of the fundamental matrix F j/ are the epipoles 
ejj and ej/ respectively 

— Transposition of a fundamental matrix yields the fundamental matrix for 
interchanged image planes: Fjj ^ F/j . 

2.3 Three Views 

Now we assume that the knowledge about the plane tt on which the space point 
X has been assumed to lie is provided by a third view K. If the image point 
Xif ^ PifX is known to lie on the line \k of the image plane K then one 
deduces from l^xx = 0 that = 0 and consequently that this plane 

is described by tt ^ ^k^k- Now plugging this specific plane into equation (0 
and using elementary calculational operations concerning Kronecker products 
we obtain 

HMP^lx) (I3 ® l^)(Pj ® Pk)(I4 ® C7 - C7 ® l4)P+ . ( 7 ) 

We define the part of the expression on the right that does only depend on 
camera parameters to be the trifocal tensor Tf^. Multiplying we also obtain 
the tensor in a more familiar form: 
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T/^^(Pj®P^)(l4®C7-Cj®l4)P+ (8) 

- Hj 7 (g) Bki - BJI ® VLki (9) 

As the fundamental matrix, the trifocal tensor is invariant with respect to projec- 
tive transformations of and encapsulates the projective structure of the space 
given three (uncalibrated) views. Besides, it has been shown that one can have 
in (|SI) homographies due to any plane that must be however the same for both 
homographies, i.e. we have T/*- ~ Hj/(7r) g) bki — ejj g Hif/(7r) V tt . 

Written in the form above the tensor appears as a 9 x 3 matrix. We will adopt 
in this paper a sort of engineering approach and treat the tensor as a matrix. 
We believe that in doing so, nothing is really lost and on the other hand some 
properties of the tensor are in our opinion easier to grasp this way, avoiding 
confusing tensorial indices. After all the fundamental matrix is a tensor as well 
but mostly referred to as a matrix. Note however that there are cases where the 
converse proceeding may well be reasonable, i.e. treat the fundamental matrix 
as a tensor (cf. |2|). Particular contractions of the tensor that will be needed 
later will appear as 3 x 3 matrices that can be extracted from the tensor either 
as particular submatrices and linear combinations thereof or as 3 x 3 reshapings 
of its three columns and linear combinations thereof. To be specific, we will be 
using the following notation: 



rj.JK 



(q,r,s) 




( 10 ) 



Here are q, r, and s 9-dimensional vectors and U, V and W 3 x 3 matrices. The 
3x3 reshapings of the three 9- vectors q, r and s will be denoted Q, R and S 
respectively, i.e. we define 

{q}3x3 =: Q, W3X3 =: R and =: S 



with 

vec(Q) = q, vec(R) = r and vec(S) = s . 

To relieve the further reading of the paper for readers accustomed to the tensorial 
notation we note that Q, R and S are the correlation slices of the tensor often 
denoted and T** respectively. These are the three classical 

matrices of and discovered in the context of line geometry. Likewise are 
U, V and W the three homographic slices jrcrjis. j.ggpg(._ 

tively described in m, PI and (El- 

Perhaps one of the most striking properties of the trifocal tensor is that this 
one entity links together corresponding points and/or lines in the three views 
as well as corresponding points and lines through corresponding points. It is 
immediately seen from equations ( 0 ), ( 0 ) and 05|l that the corresponding image 
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points X/, xj and the image line \k that goes through the third corresponding 
image point x/f are linked together by the relation 

xj^(l3 0l^)T/^x, (11) 



or, equivalently, 



xj {T/^x/jaxsltf ■ 



Now, with any line Ij going through image point xj we will also have 



l5{T/^X,}3x3lif = 0 



( 12 ) 

(13) 



or, equivalently, 

(i5:®i^)t/^x, = o . 

If we would like to interchange the roles of image planes J and K we would have to 
interchange the indices J and K everywhere in the formulas above, thus moving 
also to a different tensor. However, it is easily seen that these two tensors do not 
differ substantially from one another. In fact the one is a simple rearrangement 
of the elements of the other. This can be easily deduced from equation (tni) of 
which the very symmetrical structure may lead either back to equation lED or 
to 

1J{T/^X,}3x3-X^ (14) 

thus computing the corresponding point in image plane K from the image point 
X/ and some image line Ij going through Xj using the same tensor. 



3 The Correlation Slices Q, R and S 

We begin with describing the so called correlation slices Q, R and S of the tensor 
since these entities have been mainly used in exploring the constraints that 27 
numbers, arranged as in (II 1)11 . must satisfy in order for them to constitute a 
trifocal tensor. Starting with equation (1 1 211 and using (1 1 1)1) we obtain 

xj ^ (x}Q + XjR + o;/S)1k . 

Thus, any linear combination of the matrices Q, R and S maps lines Ik in image 
plane K onto points in image plane J. Since mappings from lines to points are 
called in the mathematical literature correlations mg, we see that linear combi- 
nations of Q, R and S are correlation contractions of the tensor. Moreover, we 
observe that by fixing x/ and varying tt or, equivalently, varying \k arbitrarily, 
x ,7 must move along the epipolar line Fj/X/ ^ Aj/ in image J since x/ remains 
fixed. All these correlations and consequently Q, R and S themselves, which 
have been called the Standard Correlation Slicing [3|, must therefore be rank 
two, singular correlations with left null space the epipolar line Xjj correspon- 
ding to the image point x/. Furthermore, since all epipolar lines intersect at the 
epipole eji the left nullspaces of all correlations {T/*-x/} 3 x 3 span only a two 
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dimensional space. This is especially true for the Standard Correlation Slicing 
Q, R and S for which according to 0 the left nullspaces are respectively the 
first, second and third column of the fundamental matrix Fj/ and fundamental 
matrices are rank two. 

As for the right null space of {T/^x/jaxs, by repeating the argument above 
but starting from eq. (1141 we find that it must be the epipolar line Xxi 
Fif/X/ in image K corresponding to the image point x/. Again, since all those 
epipolar lines intersect at the epipole bki the right null spaces of all correlation 
contractions {T/^x/jaxs span a two-dimensional space as well. In particular, 
the right null spaces of Q, R and S are the first, second and third column of the 
fundamental matrix Fki respectively. 

It is perhaps very instructive to note that the whole discussion above may be 
condensed in two simple algebraic formulas (((El) and (TTHll below) which can tell 
us even more than we could deduce geometrically. If we rearrange the elements 
of in another 9x3 matrix defining 




then it is not difficult to see from O that T will read as follows: 



t = 




vec{li^i)e^i - ® eji 



We observe that we can cancel the first term above by multiplying on the right 
by Fki obtaining 



TFki 



(!) 






and since FI^jFki (F ikFIki)'^ ^ [e/ic]x we eventually get 



/Q\ . 


f 0 


-e^iK^Ji 




R Fx/ ^ [e/if]x ® eji = 




0 




\s) ' 






0 / 



Hence, we first verify the result that the columns of Fki are the right null 
spaces of Q, R and S (diagonal entries in (II 511 1 . Furthermore, we also obtain a 
generalized eigenvector of pairs of standard correlations and the corresponding 
eigenvalues as it is easily seen. This is the line of thought we want to pursue, 
however not for correlations but for the later to be investigated homographic 
slices of the trifocal tensor. 

Similarly, by defining 

T := (Q, R, S) ^ Fiji ® e^j - e j j {vec{lil j)f 
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and multiplying on the left by F/j we cancel the second term and obtain 

F„(Q,R,S)^F/jHjz® e^i ~ [e/j]x ® ( 16 ) 

which is the analogue of equation Equations (113 and (C3 are samples 

of Heyden’s quadratic p-relations jS] but put in a much more informative and 
appealing form. We note here in passing that all other quadratic p-relations 
that are valid between the trifocal tensor, the fundamental matrices and the 
epipoles may be obtained in a similar manner by cancelling one or both terms of 
different systematic rearrangements of the tensor elements. Furthermore, tasks 
like computation of all trifocal tensors from some particular one or from the 
fundamental matrices and vice versa, can be very compactly formulated using 
the formalism introduced above. 

The preceding discussion shows that the correlation slices of the tensor ex- 
hibit many striking properties and it is natural that they have attracted much 
interest in the past. In fact, the constraints that 27 numbers must satisfy in 
order for them to constitute a trifocal tensor have been formulated on the basis 
of the correlation slices. Reversing the discussion above, it has been proved in jSj 
(cf. also jOI) that all properties of the correlations we mentioned make up also 
sufficient conditions and we adapt here a theorem in P] in the following form: 

Theorem 1 (Papadopoulo-Faugeras). If 21 numbers are arranged as in 
equation (E3) then they constitute a trifocal tensor if 

— all linear combinations o/Q, R and S are singular and 

— all right null spaces of Q, R and S span only a two-dimensional subspace 
and 

— all left null spaces of Q, R and S span only a two-dimensional subspace. 

Hence, the conditions given above are necessary as shown by our discussion and 
was well known and sufficient as shown by Theorem Q However, turning these 
conditions in algebraic constraints that should be fulfilled by the 27 numbers 
in order for them to constitute a trifocal tensor, has resulted in 12 constraints 
0 that are not independent since any number of constraints greater than eight 
must contain dependencies. 

In the next section we will elucidate the question of constraints for the trifocal 
tensor from a new point of view that will enable us to formulate a new set of 
conditions that will be necessary and sufficient and at the same time minimal. 
Towards this end we now turn attention from the correlation slices Q, R and S 
of the tensor to the homographic slices U, V and W. 



4 The Homographic Slices U, V and W 

In order to exploit these slices for the formulation of sufficient conditions for the 
tensor we first have to study their properties in some detail. To this end, we 
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return to equation da and use the vec-operator on both sides. We then obtain 
(Ij 0 I 3 )T/^ X/ ~ xk which reads more explicitly 

HkHpJIj) 



{l],V + ijV + l^jW)xi ^ xk . 

Thus, any linear combination of matrices U, V and W with coefficients being 
the homogeneous coordinates of the line Ij yields a homography from image I 
onto image K due to the plane Pjlj- Generally, these homographies are regular 
rank three. As with correlations, the Standard Homography Slicing (P|, |2j) of 
the trifocal tensor is here defined by setting for Ij the unit vectors (1,0,0)^, 
(0,1,0)^ and (0,0,1)^ that gives us respectively the matrices U, V and W 
being homographies due to the planes tti ~ ((1, 0, 0)Pj)^, 7 T 2 ~ ((0, l,0)Pj)^ 
and 7 T 3 ~ ((0,0, l)Pj)^ respectively. These planes are therefore represented by 
the first, second and third row of the camera matrix Pj respectively. Now, since 
the matrices U, V and W are homographies from image plane I to image plane K 
they must possess the homography properties given in Section 12.21 In particular, 
they must map the epipole ejK onto the epipole exi and hence the epipole bjk 
is a generalized eigenvector for the three matrices U, V and W : 

IdeiK ^ VeiK ^ WeiK ^ gri ■ (17) 

In the Appendix we also prove algebraically the following stronger result 

/(Ue,^)^\ 

}3x3 ~ ~ ejReJ^i (18) 

\{WejKVj 

that gives us in addition the generalized eigenvalues in dependence of the epipole 

ejK- 

ejK^GiK = Sj/fVe/x = e^j^WejR e^jj^WeiR = Cj^Ue/x . 

Now let us consider the space line L 23 defined to be the intersection line between 
planes tv 2 and tv 3 as defined above. Since these planes both go through Cj so 
does their intersection which is therefore an optical ray of image J. Furthermore, 
these planes intersect the image plane J by construction in the lines (0, 1, 0)^ and 
(0, 0, 1)^ which, in turn, intersect in the point (1, 0, 0)^. Hence, we obtain for this 
optical ray L 23 ^ Bj((l,0,0)^) and consequently its images onto image planes 
I and K are the epipolar lines F/j(l,0,0, )^ and Fifj(l, 0, 0, )”^ respectively. 
Similarly, we obtain for the two other intersection lines: L 31 ~ Bj((0,l,0)'^) with 
images F7j(0, 1, 0, )^ and Fif j(0, 1, 0)^ onto image planes I and K respectively 
and L 12 ^ B j((0, 0, 1)^) with images F/j(0, 0, 1)”^ and F7;'j(0, 0, 1)^ onto image 
planes I and K respectively. Next consider some pair of standard homographies, 
V and W say. Since they are due to the planes tv 2 and 7 T 3 respectively, the 
restriction of both homographies to the intersection 7 T 2 fl 7 T 3 ~ L 23 will give the 
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same line homography that maps points of the epipolar line F/j(l,0, 0)^ onto 
points of the epipolar line Fkj{^, 0, 0)^. To be specific, we will have: 

If xf F7 j( 1, 0, 0)^ = 0 then Vx/ ~ Wx/ ^ xk with x^F/f j(l, 0, 0)^ = 0 . 

Consequently, all points of the epipolar line F/j(l,0, 0)”^ are generalized eigen- 
vectors for the pair of standard homographic slices V and W. Hence, these points 
span a two-dimensional generalized eigenspace for these two homographies. For 
the sake of completeness we display the analogue result for the two other pairs 
of homographies. 

If Xj F/,/(0, 1, 0)^ = 0 then Wx/ ^ Ux/ ~ xx with x^F^^ j(0, 1, 0)^ = 0 . 

All points of the epipolar line F/j(0, 1,0)^ are generalized eigenvectors for the 
pair of homographic slices W and U and span a two-dimensional generalized 
eigenspace for these two homographies. 

If xf F/j(0, 0, 1)^ = 0 then Ux/ ^ Vx/ ^ xx with x^Fi^j(0, 0, 1)^ = 0 . 

All points of the epipolar line F/j(0,0, 1)^ are generalized eigenvectors for the 
pair of homographic slices U and V and span a two-dimensional generalized 
eigenspace for these two homographies. 

This situation is depicted in Fig. [Hand Fig. 0 




Fig. 1. Generalized eigenspaces of the ho- 
mographies U, V and W . 




Fig. 2. Generalized eigenspaces of the ho- 
mographies U“^, and . 



In summary, the generalized eigenspaces of any pair out of the three homo- 
graphies U, V and W will consist of a one dimensional eigenspace spanned by 
the epipole ejx and of a two dimensional eigenspace that is orthogonal to one 
column of the fundamental matrix F/j, i.e. it represents the points of the epi- 
polar line in image I of a point in image J that, in turn is represented by a unit 
vector. In general, considering linear combinations of the homographic slices U, 
V and W, i.e. homographic contractions of the tensor we see that essentially we 
have already proved the following theorem: 
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Theorem 2. Given any two homographies Hif/(Pjlji) and Hx/(Pjlj 2 ) bet- 
ween image planes I and K that are due to the planes Pjlji and P jlj 2 , defined 
by the lines Iji and lj 2 in image plane J and by the camera center Cj, the 
generalized eigenspaces of the two homographies are 

~ a one dimensional eigenspace spanned by the epipole bjk and 
— a two dimensional eigenspace containing all points on the epipolar line 

X lj2). 

For applications regarding the generalized eigenspaces and eigenvalues of pairs 
of homographies cf. 0 and 0. 

We note that from above follows that the second pair of epipoles on the image 
planes I and K, namely e/j and e^j is mapped by all homographies U, V and 
W and linear combinations thereof onto one another as well, i.e. we have 

Ue /,7 ~ Vejj - We/j ~ bkj ■ (19) 

Again, in the Appendix we prove the stronger result 

/(Ue7,)^\ 

{T/^e/j}3x3 I (Ve/j)'^ ~ ejie^ j (20) 

that gives us now the following double generalized eigenvalues, since we are now 
in two-dimensional eigenspaces: 

ej/Ue/j = ejjYeij ejjVeu = CjjWeij CjjWejj = ejjUejj . (21) 

The role of the epipole e/j is recognized as that of being the intersection of all 
two-dimensional generalized eigenspaces of all pairs of linear combinations of the 
standard homographies U, V and W. 

After having explored the properties of the homographic slices of the trifocal 
tensor that give us necessary conditions, we set out in the next section for finding 
a minimal set of sufficient conditions. 



5 The Sufficient Conditions 

5.1 Geometric Conditions 

We begin with the following theorem: 

Theorem 3. Any three homographies U, V and W between two image planes 
which are due to three different planes tti, 7T2 and if arranged as in m 
then they constitute a trifocal tensor. 

Thus, we get a trifocal tensor without any reference to a third camera or to a 
third image plane J. To see this we simply let the intersection of the three planes 
play the role of the camera center of a fictive third camera (tti fl 7T2 fl 7T3 ^ C j) 
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and let the intersecting lines between two planes at a time play the role of the 
optical rays that we have considered in Section 0| 

Therefore, if 27 numbers have been arranged as in relation dEJ then we only 
have to ensure that the three matrices U, V and W could be interpreted as 
homographies between two image planes. We claim that, for this to be the case, it 
suffices that the generalized eigenspaces of pairs of matrices U, V and W should 
be as shown in Fig. Q This is essentially the content of the next theorem. In fact, 
we will draw directly the conclusion from the eigenspaces constellation to the 
trifocal tensor. The matrices U, V and W will then be of course automatically 
homographies since this is a property of the tensor. 

Theorem 4. Three 3x3 matrices U, V and W, arranged as in eq. ^^) con- 
stitute a trifocal tensor if and only if the generalized eigenspaces of pairs of them 
are as shown in Fig. 3. 



Proof: The necessity has been shown in Section E To prove sufficiency we make 
use of Theorem E and of Fig. 3. 




Fig. 3. The point a is a one- 
dimensional generalized eigenspace 
of all pairs among the matrices U, 
V and W. Besides, every pair owns 
a two-dimensional generalized eigen- 
space represented by a line Si as 
shown in the figure. All three two- 
dimensional generalized eigenspaces 
intersect in the point b. 



Consider in Fig. 3 some arbitrary point x that defines a line 1 ^ axx. This line 
intersects the lines Si, S 2 and S 3 in the points u, v and w respectively. Now since 
u is element of a generalized eigenspace for the pair (V, W) and a is generalized 
eigenspace for all pairs we will have: Va ^ Wa and Vu ~ Wu . We denote the 
line Va x Vu by 1' and see that this line will be the same line as Wa x Wu 
(i.e. 1' - Va X Vu - Wa X Wu). Hence, since the point x is on 1 the points 
Vx and Wx will be on 1' since V and W represent collineations. And finally, 
by repeating the same argument for a second pair containing U we see that also 
Ux will lie on the same line 1'. That means that the points Ux, Vx and Wx are 
collinear and hence we will have det({(q, r, s)x} 3 x 3 ) = det(a;^Q -hx^R-l-x^S) = 
det(Ux, Vx, Wx)^ = 0 (cf. eq. (Ill)ll l. Since the point x was arbitrary this result 
shows that the first demand of Theorem [0 is satisfied. Now, as it is immediately 
seen, the right null space of (x^Q-|-x^R-|-x^S) = (Ux, Vx, Wx)^ will consist of 
1'. But 1' goes always through a' ~ Ua ^ Va ~ Wa. Thus, all these null spaces 
span only a two-dimensional subspace. 
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As for the left null space of the expression above, it is equivalent with the 
right null space of the transposed expression, i.e. of + x^S)'^ = 

(Ux, Vx, Wx). To see how the right null space of the last expression looks like, 
let the double generalized eigenvalue of the pair (U, V) be denoted with A 2 and 
of the pair (U, W) with ^ 2 - Then the matrices (V — A 2 U) and (W — P. 2 U) will 
have rank 1. Since (V — A 2 U)a ^ a' then we will have (V — A 2 U)x ^ a' V x, 
and similarly, (W — /r 2 U)x ^ a' V x. There are therefore always two numbers 
a and (3 such that a(V — A 2 U)x — /3(W — /i 2 U)x = 0 or, equivalently, 

(Ux, Vx, Wx)(/3^2 - a\ 2 , a, = 0 

and since {(3^2 — aA 2 ,a, — /3)^ = (0,/3, x ( 1 ,A 2 ,^ 2 )^ we see that this null 
space is always orthogonal to (l,A 2 ,/^ 2 )^ and hence lives in a two-dimensional 
subspace as well. Thus, all demands required by the Theorem 0 were shown to 
be satisfied. Q.E.D. 

A remarkable fact to note here is that, according to (I2H), the vector ( 1 ,A 2 , 
^ 2 )^ represents nothing else than the epipole ejj: ( 1 ,A 2 ,^ 2 )^ ej/. 

5.2 Algebraic Conditions 

The purpose of this section is the main scope of the paper, namely the formula- 
tion of a set of algebraic constraints that are sufficient for 27 numbers, arranged 
as in CH, to constitute a trifocal tensor. We know from the preceding section 
that all we have to do is to give the algebraic conditions for three 3x3 matrices 
U, V and W to possess the generalized eigenspace constellation shown in Fig. 
3. Quite obviously, we demand: 

— The polynomial det(V — AU) should have a single root Ai and a double root 
A 2 and (V — A 2 U) should have rank 1. 

— The polynomial det(W — /rU) should have a single root ^1 and a double 
root ^2 and (W — ^ 2 U) should have rank 1. 

— The generalized eigenvectors to the single eigenvalues should be equal. 

Since the Jordan canonical decomposition of U“^V or U“^W in the presence of 
double eigenvalues, as is here the case, could be not diagonal, the rank 1 demand 
in the conditions above is essential. It is easy to see that if these conditions are 
satisfied for the two pairs above then they will also be satisfied for the third 
pair. Besides, the line representing the two-dimensional generalized eigenspace 
of the third pair must go through the point of intersection of the other two pairs. 
Therefore, there is no need to demand any further conditions. 

The derivation of the condition for a polynomial p(A) = aX^ + bX^ -I- cA -I- c? 
to possess a double root is elementary. We give here the result. Introducing the 
abbreviations 

A := b'^ — 3ac, B ■= be — 9ad and C := — 3bd 

we have: p{X) will possess a double root if and only if 

- 4AC = 0 . 



(22) 
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If this condition is satisfied then the single root Ai and the double root A 2 will 
read 



B_b 



A a 



and 



A2 = 



B 

^ ■ 



Thus both, the condition for a double root and the roots themselves (provided 
the condition is satisfied) of a third degree polynomial may be given in terms of 
rational expressions in the coefficients of the polynomial. 

We know that the expression det(V — AU) may be expanded as follows: 
det(V — AU) = — det(U)A^ + tr(U'V)A^ — <r(V'U)A + det(V) where tr denotes 
the trace of a matrix and A' denotes the matrix of cofactors with the property 
A'A = AA' = det(A)l 3 . Note that A' will exist even if A is singular and that 
the rows (columns) of A' are cross products of the columns (rows) of A. Hence, 
the coefficients of det(V — AU) are third degree and can be given explicitly as 
outlined above, the coefficients A, B and C are sixth degree and the condition 
(E2» is of degree twelve in the tensor elements. 

Denoting with a the common eigenvector corresponding to the single ei- 
genvalues and with b the common eigenvector corresponding to the double ei- 
genvalues we have: Testing for rank(V — A 2 U) = 1 is equivalent with testing 
(V — A 2 U)c ~ Ua for any c not on the line a x b and means two constraints. 
Specifically, since the vector a x b as a point will not lie on the line a x b these 
two constraints may be formulated as (V — A 2 U)(a x b) ~ Ua. Similarly means 
testing for rank(W — /J. 2 U) = 1 another two constraints. 

In summary we obtain: 



— Two constraints by demanding eq. (Ittttli for the two polynomials det(V — AU) 
and det(W - ptV) 

— Two constraints by demanding equality (up to scale) for the two one-dimen- 
sional generalized eigenspaces (a ^ a) 

— Two constraints by demanding rank(V — A 2 U) = 1 and 

— Two constraints by demanding rank(W — ^ 2 U) = 1 . 



Thus, we have presented a minimal set of eight constraints that is sufficient for 
27 numbers to represent a trifocal tensor. It should be stressed that in this paper 
we have treated the general case ignoring some singular cases that may arise. 
Although they are of measure zero in the parameter space some of them might 
well be of practical relevance and worth investigating. 



6 Conclusion 

In this paper we have derived a minimal set of eight constraints that must be 
satisfied by 27 numbers (modulo scale) in order for them to constitute a trifocal 
tensor. This result is in accordance with the theoretically expected number of 
eight independent constraints and novel since the to date known sets of sufficient 
constraints contain at least 12 conditions. Knowledge of sufficient constraints 
for the trifocal tensor is important in constrained estimation from points and/or 
lines correspondences. It is expected that working with fewer and independent 
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constraints to be enforced will result in stabler and faster estimation algorithms. 
The key for obtaining this result was shown to be turning attention from the 
correlation slices to the homographic slices of the trifocal tensor. 

Appendix 

To prove equation PD we start with 0 which after multiplication on the right 
by eiK and rearranging gives: 

{T/*‘e/x}3x3 ^ — ejj{H.Ki^iK)'^ 

Now we use the definitions given in sections 12.11 and El and employ some alge- 
braic manipulations to get: 

{T/^e,^}3x3 ~ PjP+P/C^(P^C,)^ - PjCiiPKPjPiCKf 

- PjP+PiCkCJpI - PjCjC],P+PiPl 
^Pj{PjPl-l4 + l4)CKCjP],-PjCjCl{PjPi-U + U)Pl 

- Pj {PjPi - I4 )Ck CfP^ + PjCkCJpI - 

' X " 

~Ci 

-PjCiC'^{PjPl-l4)Pl 

^CJ 

~ P./C/fCjP^ ^ Q.E.D. 

The proof of (Eill is similar: 

{T/'^e/j}3x3 

^ P,7P + P,C,;(PkC/)^ - PjC7(PkP + P7Cj)^ 

^ PjP + P7C,7CfP^-P,7C7C3:P+P7P^ 

^Pj(P + P7-l4 + l4)CjCfP^-PjC7Cj(P+P/-l4 + l4)P^ 
^ Pj {PjPi - l4)Cj CfP^ + PjCTcCfP^ - 

' V " 

-PjC7Cj(P + P7-l4)P^ 

^ 

r^Cj 

^ PjC/CjP^ ^ ej/e^j Q.E.D. 
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Abstract. Steree cerrespondence is a central issue in cemputer visien. 
The traditional approach involves extracting image features, establishing 
correspondences based on photometric and geometric criteria and dually, 
determine a dense disparity held by interpolation. In this context, occlu- 
sions are considered as undesirable artifacts and often ignored. 

The challenging problems addressed in this paper are a) dnding an image 
representation that facilitates (or even trivializes) the matching proce- 
dure and, b) detecting and including occlusion points in such represen- 
tation. 

We propose a new image representation called Intrinsic Images that can 
be used to solve correspondence problems within a natural and intui- 
tive framework. Intrinsic images combine photometric and geometric de- 
scriptors of a stereo image pair. We extend this framework to deal with 
occlusions and brightness changes between two views. 

We show that this new representation greatly simplifies the computation 
of dense disparity maps and the synthesis of novel views of a given scene, 
obtained directly from this image representation. Results are shown to 
illustrate the performance of the proposed methodology, under perspec- 
tive effects and in the presence of occlusions. 



1 Introduction 

Finding correspondences is one of central problems in stereo and motion analysis. 
To perceive motion and depth, the human binocular vision uses shape and edge 
segmentation and performs matching between two or more images, captured over 
time. Besides, occlusions play a key role in motion and depth interpretation. 

The computation of stereo correspondence has traditionally been associated 
with a generic three-step procedure 0, summarized as follows: 

— Selecting image features, such as edges, interest points or brightness values. 

— Find corresponding features based on similarity and consistency criteria, 
where similarity considers a distance between features and consistency takes 
into account geometric and order constraints. This step is usually algorithmic 
and requires a relative weighting between similarity and consistency. 

— Compute and interpolate the disparity maps to obtain a dense field. 
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An additional step consists of detecting occlusion points, where the similarity 
and/or consistency criteria are not fulfilled. 

Almost all methods for image correspondence follow the procedure described 
above and differ only in the nature of the image features, similarity and consi- 
stency criteria, search algorithms and photometric and geometric assumptions 
(see 0 for an overview). 

We propose a method where the global stereo image information and the 
similarity and consistency criteria are well defined in a common, natural and in- 
tuitive framework. This approach can be used to generate a dense disparity map 
(explicit reconstruction) or different views of the same scene (without explicit 
reconstruction) . 

1.1 Motivation 

One of the challenges for solving the correspondence problem is that of fin- 
ding an image representation that facilitates (or even trivializes) the matching 
procedure. As an example, consider two corresponding epipolar lines of the ste- 
reo image pair shown in Figure D The simplest function we can analyse is the 
brightness function /(y) and g{x) defined along each (left or right) epipolar line 
— Figure [^. This figure shows the difficulty of the matching process since the 
gray level functions along both epipolar lines are geometrically deformed by the 
3D structure. However, we can obtain other representations for the information 
included along a scanline, as follows: 

1. A commonly used representation, mainly in optical flow computation |0I, 
consists in the spatial derivatives of the brightness (Figure [!)). Yet the 
matching process is not trivial specially for a wide baseline stereo system. 

2. To search for correspondences, one could use the plot of Figure QJi and de- 
termine the points with equal brightness value. However, this would only 
work when the brightness is kept exactly constant and would lead to many 
ambiguities, as illustrated in the figure for the gray value 240. 

3. Other possible representation consists in plotting the brightness versus its 
derivative |5| as shown in Figure D;. In this case, the image points with the 
same brightness and derivatives have approximately the same coordinates, 
indicating a correspondence. Again, there are some ambiguous situations 
(shown in the figure) and the points are coincident only if the disparity is 
constant (no perspective effects) along these image lines. 

These representations can be generalized considering other local descriptors 
(other than brightness, position and spatial gradient) computed along two or 
more corresponding scanlines. Tomasi and Manduchi [S| have proposed to repre- 
sent a set of local descriptor vectors (brightness, derivative, second derivative, 
etc) through a curve in a n-dimensional space, where the curve represented in 
Figure [D: is a simple example. Ideally two curves computed along two correspon- 
ding scanlines can be mapped (or even coincide). 

However, approaches based on curves of local descriptors vectors have obvious 
limitations related to rigid geometric distortion assumptions, solution ambiguity 
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Fig. 1. Photometric and geometric relations of the brightness values along a 
scanline captured by a stereo pair, a) Brightness values versus pixel positions; 
b) Derivatives of the brightness versus pixel positions; c) Derivatives versus 
brightness values. 



and/or high-dimensionality search algorithms. First of all, the method is only 
valid for constant and affine disparities (no perspective effects have been con- 
sidered). Secondly, the curves have a difficult representation, specially if more 
than two local descriptors are considered. Finally, a curve can cross itself, clearly 
generating ambiguous situations. 

In this paper, we develop a simple framework that overcomes the restrictive 
geometric, photometric and algorithmic constraints, mentioned before. We pro- 
pose to study other kind of representations, based not only on local descriptors 
but also on global descriptors of the image, that we call Intrinsic Images, that 
simplify the (dense) matching process and can be used to generate new views 
from a stereo pair. 



1.2 The Proposed Method 

Our main goal consists of finding a representation with the following fundamental 
features: 

1. The new representation can be represented through a simple image. 

2. It must encode both the photometric and geometric structure of the original 
image. 

3. The original images can be recovered again from this representation. 

4. The disparity can be computed easily. 

5. It can handle occlusion points. 




Intrinsic Images for Dense Stereo Matching with Occlnsions 



103 



To achieve that, we propose to use both local and global descriptors of the image 
in a new representation, so-called an Intrinsic Image. By using this representa- 
tion, the computation of correspondences and disparity fields can be done in a 
straightforward manner, and all the requirements considered above are fulfilled. 

However, some assumptions have to be made. Thus, our initial effort con- 
sists of defining three general and acceptable assumptions in order to produce a 
correspondence framework which can be considered sufficiently reliable. 

— Calibration The stereo system is weakly calibrated, which means that cor- 
responding epipolar lines are known. 

— Photometric Distortion Given two corresponding points xq and j/o> the 
respective photometric values are related by a generic non-linear function 
f{Vo) = ^( 5 (^ 0 ))) where /(j/o) and g{xo) represent a given photometric 
measure (e.g. the brightness value). We study the cases where the photome- 
tric distortion function, I'{g), is the identity (constant brightness) or affine 
(with contrast and mean brightness differences). 

— Geometric (perspective) Distortion Two corresponding profiles are re- 
lated by a disparity mapping function defined by x = ^(y), between the 
two images. We assume that ^(y) is a monotonic strictly increasing and 
differentiable function. 

In summary, we propose to study the intensity-based matching problem of 
two corresponding epipolar lines, where the matched brightness points are ruled 
by a generic model for both geometric (perspective) and photometric distortion. 

Additionally, there is a set of issues that we have to remark. First of all, the 
disparity mapping function ^(y) includes important perspective and structure di- 
stortions of the scene. However the related assumptions made before are uniquely 
piecewise valid, considering that there are discontinuities and occlusion points 
on the images. Secondly, occlusion points have to be included coherently in our 
framework, by observing their position relatively to the corresponding points. 
However, unless prior knowledge exists, no depth information of an occlusion 
point can be recovered from two images. Finally, imposing ’P{y) to be strictly 
increasing, represents an order criteria, excluding order exchanges between two 
corresponding points. In fact an order exchange implies that an occlusion will 
occur between the two views. Thus, we consider those points as occlusions. 



1.3 Structure of the Paper 

In Section 2, we consider the case of brightness constancy between the two 
views, without including occlusions, and define the Intrinsic Images and their 
most important properties. In Section 3, we introduce occlusions performing the 
necessary transformations to the proposed method. Finally, we generalize the 
method for the brightness distortion case. In the last two sections, we report 
some results on disparity estimation and generation of novel views, and discuss 
the general framework. 
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2 Intrinsic Images 

The simplest kind of matching is that of two corresponding epipolar lines derived 
from two views of the same scene without occlusions. To simplify the general fra- 
mework, we assume that there are no intensity changes due to viewing direction 
and that the disparity mapping function, ’P{y), defined as: 

x = <P{y), ^>0 ( 1 ) 

verifies the order constraint and represents the unknown deformation at y to pro- 
duce the corresponding point x. Assuming the brightness constancy hypothesis, 
the following nonlinear model can be expressed as 

f{y) = g{x) = g{Hy)) (2) 

In order to develop a simple matching framework, we propose alternative 
representations, based not only on local descriptors but also on global image 
descriptors. 

The simplest example of a global descriptor of a scanline is the integral of 
brightness. One could associate each scanline pixel, x, to the sum of all brightn- 
ess values between x = Q and x. However, this integral would be different for 
two corresponding points in the two images, due to geometric (perspective) di- 
stortion. 

In the remaining of this section, we will derive a new representation, based 
on a different global image descriptor. Using two horizontal scanlines of the 
perspective projection of the synthetic scene illustrated in Figure Q we show 
that it can deal with perspective distortion between the two images. 

Let / and g be the intensity values along corresponding epipolar lines. We now 
assume that both functions are differentiable, obtaining the following expression 
from the equation 



dfjy) ^ d<P{y) dgjx) 
dy dy dx 

In the absence of occlusions, we can further assume that all brightness informa- 
tion is preserved between two corresponding segments ]yi 2 / 2 [ and ]x\ X 2 [ contai- 
ned respectively in left and right images. Then, by assuming that d<P{y) /dy > 0 
(order constraint), one proves analytically that the following equality holds: 





dg{x) 

dx 



x=<P(y) 




dx 



(4) 



Now, let us define two functions a and /3: 



a{yi) = 



rVi 


df{y) 


'yi 


dy 



dy 



P(Xi) = 



r 


dg{x) 


J Xi 


dx 



dx 



(5) 
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where Xi > xi, yi > y\ and x\, y\ are corresponding points. Equation 13) shows 
that when yi and Xi are corresponding points, then a{yi) = (3{xi). This means 
that, associating each scanline pixel, x, to the sum of the absolute value of the 
derivatives between a: = 0 and x, we obtain the same values for corresponding 
points, independently of arbitrary image distortion, such as perspective effects. 

In the following sections we will use these functions to build photometric 
and geometric image descriptors of a stereo pair. Such combined representation, 
called an Intrinsic image, will later be used for disparity estimation and synthesis 
of new views. 



2.1 Photometric Descriptors 

Using the stereo pair represented in Figure Q as example, we can compute the 
functions mi = a(j/) and mr = (3{x). We have shown in the previous section that 




Fig. 2. Values from functions m = ot{y), m = P{x) versus the image intensities. 



mi = mr, for two corresponding points. Going a step further. Figure El shows 
the values of these functions, mi = a{y), mr = l3{x), versus the image intensity 
values, computed for the stereo images considered in the example. Not only 
the points are coincident, but also they can be represented through an image 
(putting together the various epipolar lines). One observes then a set of useful 
properties: 

a) a and fl are both monotonic increasing functions. 

b) If Xi, yi are corresponding points, then a{yi) = (5{xi), according to equati- 
ons 14151) . 

c) If a{yi) = P{xi), then /(y^) = g{x^). 

d) Let mi = a{y). Every value of mi >0 corresponds to one and only one 
brightness value f{y), meaning that the function /(m;) can be defined and 
represents a photometric descriptor. The same is applicable to g{x). 
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These photometric descriptors, built from the stereo images, code photome- 
tric information available in the two images of the stereo pair, irrespective to 
perspective distortions. Hence in the absence of occlusions and under brightness 
constancy they are equal for the two images. Later, in Section 3, we will show 
how to use this property for occlusion detection. 

When building a photometric descriptor for the image pair, we have lost 
information about the spatial domain that could lead to the computation of 
disparity. This aspect will be addressed in the following subsection. 

2.2 Geometric Descriptors 

We have seen a representation that codes all the photometric information of the 
stereo image pair, f(rn) and g(rn), and we now need to retrieve the geometrical 
information that is related to disparity. 

Let us define the generalized functions y'{m) = dy/dm and x' {m) = dx/dm. 
These functions, x'{m) and y'(rn), are computed from images and take into 
account the local geometric evolution of the brightness along the scanlines. 

Hence, we can form an image x'{m) and y'{m) by putting together various 
epipolar lines of an image. These descriptors convey all the necessary geometric 
(disparity) information available in the image pair. Each disparity, d{xi), and 
pixel value, Xi, can be recovered for each value of rrii, as follows: 

TTLi rTrii 

x'{m)dm, / 

Jo 

We can generalize the definitions above for images with brightness disconti- 
nuities. In order to guarantee the validity of same theory, we define df{y)/dy as 
a generalized function, admitting Dirac deltas in some points (at the brightness 
discontinuities). Thus, a is also discontinuous, f{m) is uniquely defined for a 
restricted domain, and x'(m) is zero in non-imaging areas (it means, values of 
m not represented by a brightness value). 

The geometric descriptors, together with the photometric descriptors form 
complete Intrinsic Images that contain both photometric and geometric infor- 
mation, represented in the same coordinate system, of the stereo pair. 

2.3 Definition of Intrinsic Images 

In this section we will define formally the Intrinsic Images obtained from the 
photometric and geometric descriptors presented in the previous sections, and 
introduce some of the interesting applications of these representations. 

Let k index corresponding epipolar lines of an stereo image pair. Then, the 
Intrinsic Images^ X{m,k) and y(m,k) are defined as: 

y{m, k) = (f{m,k), y'{m,k)) X{m, k) = {g{m,k), x' {m,k)) (7) 

where m is computed by equation Q and (to, k) are the coordinates of the 
intrinsic images. It is possible to reconstruct completely the original left and right 



{xi,d{xi)) = 



{y'{m) — x'{m))dm 



( 6 ) 
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images based on y{m, k), X{m, k), respectively. Figure^lshows the photometric 
and geometric components of the Intrinsic Images, computed for the stereo 
image pair shown in Figure ^ 




Fig. 3. Intrinsic images. Top: photometric descriptors. Bottom: geometric de- 
scriptors. 

Given the properties described before, we state the following observation for 
Intrinsic Images, under the brightness constancy assumption and in the absence 
of occlusions: 

Observation 1 — Intrinsic Images Property 

If f{y,k) = g{^{y),k) and d^{y)/dy > 0 for all {y, k), then f{m,k) = g{m,k) 
and the relation between x'{m,k) and y'(m,k) gives the geometric deformation 
between eorresponding points. 

From this observation we derive an interesting property, that will allow us to 
generate novel views of the observed scene, including all perspective effects. 

Observation 2 — Synthesis of new Images 

Assume that both cameras are parallel, related by a pure horizontal translation 
T, with intrinsic images y = (f{m,k), y'{m,k)) and X = (f(m,k), x'{m,k)). 
Then, views at intermediate positions jT (where 0 < j < 1^ have the following 
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intrinsic images, Xj : 

Ij{m, k) = {f{m,k), jx'{m,k) + {1 - j)y'{m,k)) (8) 

Proof. Suppose that the disparity between two corresponding points Xj and y 
of two generic parallel cameras is given by the well known expression Xj = 
y + jT jZ, where Z denotes the depth of the object relatively to the cameras 
and jT represents the translation between them. Derivating both terms of the 
equality in relation to m, we obtain 

Xj{m) = y'{m) + jT (9) 

Finally, by performing a weighted average between the left and the right cameras 
(j = 0 and j = 1 respectively), we obtain the same expression for Xj{m): 

/ dZ~^ 

= jx'{m) + {I- j)y'{m) = y'{m)+jT-j— (10) 

□ 

This result provides a means of generating intermediate unobserved views 
simply by averaging the geometric component of the original Intrinsic Images. 
It accounts for perspective effects without an explicit computation of disparity. 

We have described the essential framework for our matching approach as- 
suming brightness constancy and in the absence of occlusions. We have defined 
Intrinsic Images and shown how to use these images to compute disparity in a 
direct way. Next, we will introduce occlusion information and, finally, relax the 
brightness constancy constraint. 

3 Dealing with Occlusions 

In computer vision, occlusions are often considered as artifacts, undesirable in 
many applications, ignored in others. However, occlusions are neither rare nor 
meaningless in stereo interpretation. Anderson and Nakayama 0 have shown 
that an occlusion plays an important role for the human perception of depth 
and motion, and (Zj describes a theoretical connection between occlusion classi- 
fication and egomotion detection in computer vision. 

Many algorithms have been designed to handle occlusions for multiple image 
motion or disparity estimation ISE0. Here, we focus on introducing occlusions 
in Intrinsic Images, ensuring that there is no exceptional treatment based on 
cost functions or adhoc thresholds. 

An occlusion occurs when a surface is in front of an occluded region, which 
can be seen by only one camera. However, unless we impose prior models to 
image disparities, 3D structure or to global image features, we can detect the 
existence of a local occlusion based on photometric information. We will show 
how to include occlusion information in Intrinsic Images, based on the theory 
developed in the last section. 
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First of all, even in the presence of occlusion points, an intrinsic image can 
be defined as before. However, the Intrinsic Images Property stated in Obser- 
vation 1, is not verified because its sufficient condition is not verified. In fact, 
the condition f(y,k) = g{<I>(y),k) is only piecewise valid (along corresponding 
profiles), but it is not valid in general, namely in occlusion points. 

It is worth noticing important differences between the usual cartesian images 
and the associated photometric descriptor of the intrinsic images, in a stereo 
pair. While the intrinsic images will only differ in the presence of occlusions, 
cartesian images differ both by perspective effects and occlusions. This means 
that, in order to detect occlusions or, equivalently, photometric dissimilarities, 
we can rely on the photometric descriptors of the intrinsic images, where the 
geometric distortions have been removed. 

Therefore, we propose to define global image descriptors similar to those 
discussed previously. Consider 



computed on corresponding epipolar lines of the left and right cameras, where 
x\ and yi are the respective initial (and not necessarily corresponding) points. 

In the previous section, we have shown that in the absence of occlusions, 
mi = mr for corresponding points, greatly simplifying the matching procedure. 
In the presence of occlusions, the matching is not that trivial, but mi and 
can still be related by a simple function. Let mi and m^ be parameterized by t 
as follows: 



The curve produced by these functions can yield uniquely three forms: 

1. Horizontal inclination {dr{t)/dt = 1; ds{t)/dt = 0): there exists a region in 
the left camera, that is occluded in the right camera. 

2. Vertical inclination {dr{t)/dt = 0; ds{t)/dt = 1): there exists a region in the 
right camera, that is occluded in the left camera. 

3. Unitary inclination {dr{t)/dt = 1; ds{t)/dt = 1): both profiles match (no 
occlusions) . 

Figured shows examples of two (to/, mr) matching scenarios with and without 
occlusions. 

Hence, the problem to solve is that of determining the mapping curve from 
TO/ to TOr, which can be done through several approaches. Fortunately, there is 
a low cost algorithm that solves it optimally, in the discrete domain. 

Assume that we have two sequences F = {/(to/i), /(to/ 2 ), ..., f{mip)} and 
G = {g{mri), g{mr 2 ), diiTT-rq)} given by the photometric information along 
corresponding epipolar lines of the left and right intrinsic images. The corre- 
sponding points constitute a common subsequence of both F and G. Moreover, 
finding the set of all corresponding points corresponds to finding the maximum- 
length common subsequence of F and G. This problem consists in the well known 





TO/ = r{t) mr = s{t) 



( 12 ) 
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Fig. 4. Matching scenarios without (left) and with occlusions (right). 



longest-common-subsequence (LCS) problem which can be solved efficiently 
using dynamic programming. 

Notice that an LCS solution for our problem implies two things: (1) the cor- 
responding points obey to the order constraint; (2) an order exchange represents 
existence of occlusions. These two implications can produce some ambiguity si- 
tuations. However, such cases correspond mostly to perceptual ambiguities only 
solved with prior knowledge about the scene. 

After finding an LCS solution, or, equivalently, the curve that matches mi 
and mr, we can change the definition of the Intrinsic Images in order to maintain 
the theory described in last section. 

Observation 3 — Intrinsic Images Property with occlusions 

Given two stereo images, if f{y,k) = g{<I>{y),k) and d<P{y)/dy > 0 for almost 
all (y, k) (except for occlusions points), then it is possible to determine a pair 
of intrinsic images y(t,k) and X(t,k). 

y{t,k) = (f{t,k), y'{t,k)) X{t,k) = {g{t,k), x'{t,k)) (13) 

where f(t,k) = g{t,k) and the relation between x'{t,k) and y'(t,k) gives the 
geometric deformation between corresponding and occlusion points of the stereo 
images. Knowing the functions r{t) and s{t) of equation 47^1 . these new int- 
rinsic images are found based on the original intrinsic images (as presented in 
Section 2), by performing the following transformation: 



if{t, k), y'{t,k)) 



k), y'{r{t),k)) if dr{t)/dt = l 

{g{s{t), k), 0) if dr{t)/dt = 0 



{git, k), x{t,k)) = 



{g{s{t), k), x'{s{t),k)), 
{f{r{t), k), 0) 



if ds{f)/dt = l 
if ds{t)/dt = 0 ^ ' 



This observation has two important implications. First, by computing the 
functions r and s as a solution of the LCS problem, we can derive a coherent 
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framework which permits to compute disparities or generate new views as defi- 
ned in last section. Secondly, the generated intermediate views exhibit consistent 
information in occlusion areas. However, it does not imply that this information 
is consistent with the real 3D structure of the occluded points (given the impos- 
sibility to recover that structure). 



4 Photometric Distortion 



So far we have considered the hypothesis of brightness constancy. In a real stereo 
system, however, the brightness can change due to viewing direction or different 
specifications of the cameras. A convenient model to account for photometric 
distortion is the following: 

f{y) = a- g{<P{y)) + b, a>0 (15) 



This model represents an affine transformation, where a and b are the diffe- 
rence in contrast and brightness between the two images. Some authors prefer 
to estimate a priori the affine parameters P], before applying the correspondence 
procedure. This can be performed by analyzing the global intensity function and 
its derivatives. However, it would be constructive to study the geometric influ- 
ence of the affine distortion on the intrinsic image structure. 

Considering the brightness distortion in Equation C3), the effect of bias, 6, 
can be eliminated by preprocessing both signals with a zero-mean filter. Thus, 
we assume that the contrast different term a is the dominant term. 

By assuming that f{y) = a ■ g{<P{y)) and applying equations Q, a simple 
relation between the horizontal axis of the intrinsic images is found: mi = a-m^. 
This means that the photometric information along the scanlines of the left 
intrinsic image is scaled along the horizontal axis and in amplitude (the intensity 
values), by a, with respect to the right intrinsic image. Thus, the geometric 
deformation induced by the brightness distortion in the intrinsic images, is ruled 
by the following equality: 

fi'm-i) ^ g{mr) 
mi mr 

We could apply directly to the intrinsic images a correspondence procedure 
by using the equation Nevertheless this implies some search effort. 

Another solution to overcome this problem consist of transforming the brightn- 
ess function by a simple function which exhibits some invariant properties related 
to linear distortions. The logarithm function is a good candidate. In fact, when 
y and x are corresponding points, we have: 



dlog\f{y)\ 

dy 



d^{y) dlog \g{x)\ 
dy dx 



(17) 



defined wherever /(y) and g{x) are different of zero. Applying this relation to 
equation m, one can conclude that it is still possible to define coherent photo- 
metric and geometric descriptors, since mi = m^ (in absence of occlusions). This 
means that, in general, the intrinsic image theory remains applicable. 
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5 Results 

Along the paper, we have used a simple synthetic stereo pair in order to illust- 
rate the various steps of our approach. By using the intrinsic images shown in 
Figure 0 we have used Equation m to compute a dense disparity map, shown 
in Figure 0 without requiring any search algorithms. We applied the same me- 
thodology to another stereo pair, from the baseball sequence, and computed the 
respective disparity map, shown in Figure 0 




Fig. 5. Disparity map determined directly from the Intrinsic Images. On the 
left: Disparity map of the synthetic pair; On the center: Left image of a stereo 
pair from the baseball sequence; On the right: the associated disparity map. 



In Figures 0L, we show results obtained with the Synthetic Images, “flower 
garden” and “trees” sequences. Except in the synthetic case, a significant amount 
of occlusions are present. We have applied the proposed method, based on the 
Intrinsic Images to synthesize new views from the original left and right stereo 
images. The occluded regions move coherently in the sequences. The perceptual 
quality of the reconstructed images is comparable with that of the original ones. 
Notice that in the “trees” sequence there is an area, roughly on the center of 
the image, where some occlusions associated to order exchanges create erroneous 
solutions for the synthesis of that area. This can only be solved introducing more 
images of the sequence or with prior knowledge about the scene. To illustrate 
the ability of the method to cope with occlusions. Figure 0) shows the recovered 
views on a detail of the images (see little square in the respective sequence), with 
a strong discontinuity in the depth. As expected, the occluded points disappear 
coherently behind the tree, without any smoothing. 

These examples illustrate how Intrinsic Images deal with perspective effects 
and occlusions, generating dense disparity fields and novel views from an original 
stereo pair. 

6 Conclusions 

We have proposed a new image representation - Intrinsic Image which allows for 
solving the intensity-based matching problem under a realistic set of assumpti- 
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Fig. 6. a) Synthesis of intermediate views for the synthetic images (left), the 
Flower Garden Sequence (center) and for the Tree Sequence (right), b) The 
evolution of a detail of the Tree Sequence (see little square in the respective 
sequence) with occlusions. 
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ons. The concept of intrinsic images is a useful way to approach the problem of 
stereo vision, and leads to a straightforward computation of disparity and new 
views. 

Intrinsic Images of a stereo pair give exactly the complete photometric and 
geometric structure of the 3D scene, independently of the observed geometric 
(perspective) distortion. Secondly, the occlusions are introduced naturally in 
this framework, which means that we apply a consistent interpretation to the 
occluded points. 

An Intrinsic Image is composed by a photometric and a geometric descriptor, 
which contain all the necessary information to reconstruct directly the original 
images, disparity values and other views of the same scene. We have presented 
some results with real stereo images. The method is very robust to the existence 
of occlusions and reveals high performance in multi- view synthesis. 

This approach is a powerful tool to cope not only with a stereo pair of images 
but also with a sequence of images, both in the discrete and continuous sense. In 
the future, we plan to study the potentialities of this approach applied to a larger 
number of images, namely in dense optical flow computation and egomotion 
estimation. We plan also to develop an automatic occlusion characterization 
based on a large number of images of same scene, in order to facilitate the 
creation of the associated intrinsic images with occlusion information. 
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Abstract. This paper addresses the problem of the local scale parame- 
ter selection for recognition techniqnes based on Gaussian derivatives. 
Patterns are described in a feature space of which each dimension is 
a scale and orientation normalized receptive field (a unit composed of 
normalized Gaussian-based filters). 

Scale invariance is obtained by automatic selection of an appropriate lo- 
cal scale jl'inOSb) and followed by normalisation of the receptive field to 
the appropriate scale. Orientation invariance is obtained by the deter- 
mination of the dominant local orientation and by steering the receptive 
helds to this orientation. 

Data is represented structurally in a feature space that is designed for 
the recognition of static object configurations. In this space an image is 
modeled by the vectorial representation of the receptive field responses at 
each pixel, forming a surface in the feature space. Recognition is achieved 
by measuring the distance between the vector of normalized receptive 
helds responses of an observed neighborhood and the surface point of 
the image model. 

The power of a scale equivariant feature space is validated by experimen- 
tal results for point correspondences in images of different scales and the 
recognition of objects under different view points. 



1 Introduction 

Object indexing is the problem of determining the identity of a physical object 
from an arbitrary viewpoint under arbitrary lighting conditions. Changes in the 
appearance of the object under variations in lighting and viewpoint make this a 
difficult problem in computer vision. The classic approach is based on the idea 
that the underlying 3D structure is invariant to viewpoint and lighting. Thus 
recovery of the 3D structure should permit the use of techniques such as back- 
projection and features matching |Fa,ii93j to match the observed structure to a 
data base of objects. 

An alternative to 3D reconstruction is to remain in the 2-D image space 
and to consider measurements of the object appearance. Sirovitch and Kirby 
pKiq showed that in the case of face recognition. Principal Components Ana- 
lysis (PCA) can be used to generate a low dimensional orthogonal sub space. 
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The distance of two points in this space is determined by inner product, whose 
computation cost depends on the dimensionality of the space. Turk and Pentland 
refined and popularized this approach, greatly enhancing the acceptance 
of principal components analysis as a vision technique. Murase and Nayar [mmtt;] 
extended this idea by expressing the set of appearances of objects as a trajectory 
in a PC A space. Black and Jepson iFTmni demonstrated that the appearance of 
a hand making a gesture could also be expressed and matched as a trajectory in 
a PC A space. 

All of the above techniques are sensitive to partial occlusions and scale nor- 
malization. The position of a projected image in the PCA space is coupled with 
the appearance of the image. Object translations within the image, variable 
background, differences in the image intensity or illumination color alter the 
position of the image in PCA space. Thus, PCA techniques require object detec- 
tion, segmentation and precise normalisation in intensity, size and position. Such 
segmentation and normalization is very difficult, and there exists no approach 
that solves this problem in the general case. 

Segmentation and normalization problems can be avoided by using local ap- 
pearance based methods ISchhVICCh8al[?T^ that describe the appearance of 
neighborhoods by receptive fields. The effects of background and partial occlu- 
sion are minimized by considering small neighborhoods. The problem of object 
position within the image is solved by mapping the locally connected structures 
into surfaces or multi-dimensional histograms in a space of local appearances. 
Robustness to changes in illumination intensity are obtained by energy norma- 
lization during the projection from image to the appearance space. The resulting 
technique produces object hypotheses from a large data base of objects when 
presented with a very small number of local neighborhoods from a newly acqui- 
red image. 

In this paper the local appearance technique proposed Colin de Verdiere 
[ICCC)8a,| is extended to a local description technique which is scale and orien- 
tation invariant. A description of the local visual information is obtained using 
a set of Gaussian derivatives. The Gaussian derivatives responses to different 
images result in a family of surfaces. Such a surface is another representation 
of the model image. Recognition is achieved by projecting neighborhoods from 
newly acquired images into the local appearance space and associating them to 
nearby surfaces. This technique leads to the problem of the Gaussian derivatives 
parameterization in scale and orientation. 

Lindeberg defines a set of Gaussian derivatives operators to select scale for 
edge, ridge, corner, and blob features, thus for feature detection [I ;iny8aj . We 
adopt this approach and apply it to all image points. Local appropriate scales 
are detected at every point allowing the parameterization and normalization 
of the Gaussian derivatives. This leads to a scale equivariant description. Also, 
detecting the dominant orientation of neighborhoods allows to normalize the 
receptive fields by orientation lEsm. An orientation equivariant description 
is found. The scale parameter is important also for orientation normalization, 
because a scale that is not well adapted to the local structure makes orientation 
detection instable. 
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In this article we investigate local scale selection for appearance based re- 
cognition techniques. As demonstrated below the appropriate local scale is an 
important factor. We focus on the local scale selection according to the fea- 
ture type. Experiments show that local scale selection with consideration of the 
feature type improves object recognition. 

In the next section the pattern description and representation is explained. 
The proposed approach can be applied to patterns in spatial or frequency do- 
main. Then the scale and orientation equivariance property is described accor- 
ding to the main publications in this area. As a result we explain our contribution 
to the local scale selection considering the feature type. Experiments in scale and 
orientation validate the proposed approach. 

2 Pattern Description and Representation 

The appearance of an object is the composition of all images of the object obser- 
ved under different viewing conditions, illuminations, and object deformations. 
Adelson and Bergen |™t 1 define the appearance space of images for a given 
scene as a 7 dimensional local function, I (x, y, A, t, Vx,Vy, 14), whose dimensions 
are viewing position, (14,14,14), time instant, (t), position, (x,y), and wave- 
length, (A). They call this function the “plenoptic function” from the Latin roots 
plenus, full, and opticus, to see. The use of description techniques and the use of 
representation models of descriptors responses allow the analysis of the plenoptic 
function for recognition problems. 

Adelson and Bergen propose to detect local changes along one or two ple- 
noptic dimensions. The detector responses, that code the visual information, 
are represented by a table in which they are compared pairwise. Adelson and 
Bergen use low order derivative operators as 2-D receptive fields to analyse the 
plenoptic function. However, their technique is restricted to derivatives of order 
one and two. No analysis of three or more dimensions of the plenoptic function 
is investigated and little experimental work is published on this approach. 

Nevertheless the plenoptic function provides a powerful basis for recognition 
systems. This paper deals with such framework where patterns are characte- 
rized by describing their local visual information and modeling the descriptor 
responses. The result is a recognition software based on local properties. 

Consider the plenoptic function, / (a;, y, 14, V),, 14), constrained to a single 
frame and a gray channel. I () is analyzed by a set of receptive fields. An or- 
thogonal basis of receptive field responses can be found that span the space of 
all receptive field responses. The goal is to decrease the dimensionality of this 
space by determining those receptive fields which allow an optimal description 
of the appearance. This basis can vary according to the nature of the recognition 
problem. The next section discusses the construction of receptive fields accor- 
ding to different signal decomposition techniques. Then two methods of pattern 
representation in the feature space are discussed. The first is a statistical repre- 
sentation where objects are characterized as the joint statistics of the receptive 
field responses and, the second is a structural approach where connected struc- 
tures in the images are mapped as surfaces in the feature space. 
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2.1 Signal Decomposition 

Classically the description of a signal is obtained by its projection onto a set of 
basis functions. Two widely used approaches for signal decomposition are the 
Taylor expansion (equation and the Fourier transform (equation |21l, corre- 
sponding respectively to the projection of the signal onto basis functions with 
modulated amplitude and to the projection of the signal onto a function base 
which is frequency modulated: 

OO 

n\ 

n=0 



f(t)= Y. ( 2 ) 

n = — oo 

Note that there exist other local decomposition bases. The nature of the pro- 
blem motivates the choice of the decomposition base. For example a frequency- 
based analysis is more suitable for texture analysis, and a fractal-based descrip- 
tion for natural scene analysis. But independently from the basis choice, the 
receptive fields responses are estimated over a neighborhood which size is rela- 
tive to the locality of the analysis. 

The derivative operator of the Taylor expansion and the spectral operator of 
the Fourier transform can be formulated as generic operators. The concept of li- 
near neighborhood operators was redefined by Koenderink and van Doom |K^ 
as generic neighborhood operators. Typically operators are required at different 
scales corresponding to different sizes of estimation support. Koenderink and 
van Doom have motivated their method by rewriting neighborhood operators as 
the product of an aperture function, A(p, cr), and a scale equivariant function, 

G{p)= A (p, (j) (f> (p/ a) (3) 

The aperture function takes a local estimation at location p of the plenoptic 
function which is a weighted average over a support proportional to its scale pa- 
rameter, a. The Gaussian kernel satisfies the diffusion equation and can therefore 
serve as aperture function: 



A (p, a) 




(4) 



The function (j)(p/a) is a specific point operator relative to the decomposition 
basis. In the case of the Taylor expansion <j) {p/<j) is the Hermite polynomials 




(f) (p/a) = (-1)" He„ (p/a) (5) 

In the case of the Fourier series (//{pja) are the complex frequency modulation 
functions tuned to selected frequencies, w. 



^(p/a) = 



(6) 



Within the context of spatial, or spectral, signal decomposition the generic 
neighborhood operators are scale normalized Gaussian derivatives fl;in98b| , and 
respectively scale normalized Gabor filters. 
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2.2 Pattern Representation in Feature Space 

The computation of a vector of descriptors can be formally modeled as a pro- 
jection from the image pixel space to a new space more suitable for indexing. 
This descriptor space is composed of N receptive fields corresponding to a set of 
Gaussian derivatives or Gabor filters for example. An image neighborhood which 
is a vector in a M dimensional space can be represented in the descriptor space 
by a fV dimensional vector with N <C M . The distance between two points in 
the descriptor space is a measure for the similarity of these neighborhoods and 
is used for recognition. 

An object signature is obtained by representing or modelizing the receptive 
fields responses in the feature space, either statistically or structurally. Schiele 
has shown that the local appearance of static objects can be represented 
statistically using multi-dimensional histograms of Gaussian derivative respon- 
ses. Histograms of object classes are compared and the conditional probability is 
returned that the observed vector is part of the trained classes. Golin de Verdiere 
[KlG98a,j has used a structural representation by sampling the local appearance 
of static objects. Such discrete sampling permits recognition from small neigh- 
borhoods by a process which is equivalent to table lookup. 

Statistical representation: The output from the set of receptive fields provides a 
measurement vector at each pixel. The joint statistics of these vectors allow the 
probabilistic recognition of objects. A multi-dimensional histogram is computed 
from the output of the filter bank. These histograms can be considered as object 
signature and provide an estimate of the probability density function that can be 
used with Bayes rule. Schiele |SG96j uses this methods for object recognition and 
Ghomat extends the approach to the recognition of activity patterns. 

Structural approach: At each pixel a measurement vector of the output from 
the set of receptive fields is stored in association with an identifier of the model 
image and its position within the model image. The storage of vectors associated 
to all model image points enables a simultaneous identity and pose recognition 
of object by matching of measurement vectors. This matching is performed effi- 
ciently by using a hierarchical data structure. Gompetitive evaluation of multiple 
vectors of an image provide a highly discriminant recognition of the learned ob- 
ject. As a result, this recognition scheme returns one (or multiple) object poses 
associated with a confidence factor based on the number of points detected on 
this pose |GG98alGG99M 

In this paper a structural representation of data in a feature space is used for 
the recognition of static object configurations. In the structural approach the 
vectors with the shortest distance are searched. The class of the observed vector 
is the class of the vector with the shortest distance. The classification is based 
on searching all vectors that are within a sphere centered on the observed vec- 
tor. For an efficient performance of the classification task, the storage structure 
is very important. Golin de Verdiere [k !< !98aj proposes an indexation storage 
tree, in which each vectorial dimension is decomposed into 4 parts successively. 
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During the training phase new vectors can be added easily to the tree. The ad- 
vantage of this structure is that all vectors within the search sphere centered on 
the observed vector can be computed efficiently. 

Note that the addressed problem in this paper is not a critical study of in- 
dexing techniques, but the local scale parameter selection for Gaussian based 
descriptors in the context of object recognition. So, the choice of a structural 
full description of the appearance is very suitable to this study, but other reco- 
gnition schemes can take profit of this automatic scale selection as for example : 
a statistical representation by histograms |(j(j99all^ch97j or an interest points 
based approach like Schmidt |S.\h:| who selects a priori interesting points and 
modelizes only these points in her system. 

2.3 Conclusion 

The quality of recognition techniques depends on their ability to recognize ob- 
jects in a scene under a minimum of assumptions. Generally the required pro- 
perties are their robustness or invariance to illumination and view point variati- 
ons. 

The robustness to illumination variations and point of view changes is ob- 
tained by sampling the appearance of the object by including images with these 
changes into the training base. It is possible to view the set of different appea- 
rances of one object, that result in a trajectory in the appearance space |IVI IN95| 
l( X ’98h) parameterized by illumination and view point. For example, scale ro- 
bustness is achieved by learning the object at several different scales and mat- 
ching a new image of the object to each of the trained scales. Another approach 
to achieve robustness is to model the object at a fixed scale and then match the 
images of a pyramid of a new observed image of the object. 

As shown in this section Gaussian based techniques are well suited for a scale 
equivariant description since the scale parameter of the aperture function and of 
the point operator is explicit. A scale invariant Gaussian based description can 
be obtained by an appropriate scale parameter selection. The problem of the 
detection of orientation is solved using the property of steerability of Gaussian 
derivatives. Freeman and Adelson iFxm] use Gaussian derivatives, to compute 
the derivative under arbitrary orientation by a linear combination of a finite 
number of derivatives. 

The approach we propose is to use the properties of scale and orientation para- 
meterisation of Gaussian based local descriptors to design receptive fields that 
are equivariant to scale and orientation. The use of Gaussian derivatives as local 
descriptors provides an explicit specification of scale and orientation. Scale is 
specified by the a parameter providing a scale invariant feature. Using steerable 
filters Esm, it is possible to compute the derivative under arbitrary ori- 
entation by a linear combination of a finite number of order derivatives. In 
this paper a set of normalized Gaussian derivatives up to Order three is used 
to describe the plenoptic function. A scale detector and an orientation detector 
are used to normalize and steer Gaussian derivatives. The local description is 
equivariant to scale and orientation and allows recognition which is invariant to 
scale and orientation. Note that Gaussian derivatives can be efficiently computed 
by using a recursive implementation (see |W^ 1. 
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3 Scale Invariance 

Theoretically there exist specific features that are scale invariant such as corners 
and edges. Practically this is not the case because edges resemble more to a ramp 
over scales. However these features can be described using scale equi variant de- 
scriptors. The first paragraph of this section demonstrates the scale equi variance 
property of Gaussian derivatives by normalizing them according to the scale pa- 
rameter. The next paragraph deals with two scale invariant representations. One 
is based on a multi-scale data representation (or pyramidal representation), and 
another one is based on local scale parameter selection. 

Scale equivariant receptive fields A scale equivariant feature space is designed 
using normalized Gaussian derivatives taking into account their scaling property. 

do:r<.g{s-x,s-a) = ~^d^nG{x,a) ( 7 ) 

Gonsider the Gaussian filter, G (x, a), and the one dimensional signal, / (x). Let 
L (x, a) be the response of the Gaussian filter: 

L(x,o-) = f{x)*G(x,a) (8) 

The normalization of the Gaussian derivatives responses according to a selected 
scale parameter, a, is: 

d^nL{^,a) = ^ yjith ( 9 ) 

dx^ cr 

This scale normalization leads to a descriptor which is scale equivariant 

d^'-nL ,a '^ = d^nL (Go-) with x =s-x and a =s-a (10) 

A local scale equivariant feature space can be built using such scale normali- 
zed descriptors but the a priori knowledge of the local feature scale is necessary. 

Scale invariant modelization Traditionally, the scale parameter of the filters is 
defined intuitively, according to the size of the features to be recognized and 
a multi-scale strategy is adopted to overcome the problem of scale variations. 
Models of objects are built at several scales and matching is done by comparison 
within the different trained scales. Gurrently a similar strategy is adopted to 
be robust to changes in orientation by learning several orientations. The goal of 
such strategies is to become robust to scale changes. This robustness lays on the 
structure for data representation or modelization. The main problem still in the 
parameterization of receptive fields, and generally the scale parameter is fixed 
but not appropriate to the local information. 

The approach we propose is to locally estimate the required scale at each point 
and to normalise the Gaussian derivative filters to the local scale. With such 
a method there exists no data representation at several scales (which is redun- 
dant), but only one single scale invariant data representation. Several maps of 
selected scales are used depending on the features to be analysed. 
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4 Detection of Orientation 

Structured neighborhoods can have a dominant orientation. The dominant di- 
rection of a neighborhood can be found by determining the filter direction that 
gives the strongest response. 

There are two ways to determine this filter. First, a set of filters can be 
generated that are rotated by a small angle. Then each filter of this set is applied 
to the neighborhood. If a precise orientation is required the number of generated 
filters is very high and also the computation cost of the operation. 

A second possibility is to use only a small number of appropriate filters 
and interpolate between the responses. With an appropriate filter set and the 
correct interpolation rule the response of the neighborhood to a filter with an 
arbitrary orientation can be determined without explicitly applying this filter to 
the neighborhood. Freeman |KAl)l| uses the term steerable filter for such a filter 
class. 

Steerable Filters Let Gn be the order derivative of the Gaussian function. 
Let 0® be the rotation operator so that a function f{x, y)^ is the function f(x, y) 
rotated by 0. The synthesized filter of direction, 0, can be obtained by a linear 
combination of and fFAtllj 

Gi = cos(6i)Gr -t sin(6»)Gf ° (11) 

Equivariance of Orientation Let I be an image. In I let w be the neighborhood 
around p with the dominant orientation 9. Let be I rotated by w. The to 
w corresponding neighborhood w' in has then the dominant direction 9 + lo. 
(G5°)“ and (Gf°°)‘^ are the basis functions G?” and G®°° rotated by uj and can 
be written as 



(Gf )“ = cos(w)Gf -h sin(w)Gf ° = Gf +“ (12) 

(Gf “)“ = - sin(o;)Gf + cos(a;)Gf ° = Gf (13) 

The equivariance results from 

(G®)“ = cos(0)(Gf )“ + sin(0)(Gf °)“ (14) 

= cos(0 -h w)Gf -h sin(6» -h w)Gf ° = G®+“ 

Orientation Invariance Taking into account the values of the gradient Gi in x 
and y direction the dominant direction of a neighborhood can be determined by 

S 6 

e = atan2( — L(x, y- a), y; a)) (15) 

For each neighborhood in the image the dominant direction can be determined, 
which allows to normalise each neighborhood in orientation. Using the equivari- 
ance property two corresponding neighborhoods and will be normalized 
to the same neighborhood 
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5 Local Scale Selection 

Features in a scene appear in different ways depending upon the scale of observa- 
tion. Traditionally, when scale is considered, image structures are represented in 
a multi-scale pyramid and processing is applied to a set of scales. Such techniques 
are sensitive to the fact that some features may disappear at too coarse scales 
or too fine scales. It is therefore necessary to determine an appropriate scale for 
each observed feature. Targeting this appropriate scale for the projection in the 
feature scale in association with scale invariant features computation enables a 
scale independent representation. 

Lindeberg |Tjin98bj proposes a framework for generating hypotheses about 
scale levels based on the assumption that local extrema over scales of normali- 
zed derivatives correspond to interesting structures. This approach gives rise to 
analytically derived results which correspond to intuition for scale selection for 
detecting image features. 

This section provides experiments based on Lindeberg proposal for features 
scale selection. We are interested in receptive fields which are scale invariant and 
orientation invariant. Scale invariance is obtained by selecting a scale relative 
to the feature shape. Such scale selection is available using a Laplacian based 
operators. The detection of orientation is done at an appropriate scale where the 
gradient is stable. 

5.1 Blob Features Scale Selection 

The proposed general scale detector of equation is expressed as a polyno- 
mial combination of normalized Gaussian derivatives, where the normalization 
controls the scale invariance. 



Lap {x, y,as) = (cr^) {d^^g {x, y, as) + dyyQ {x, y, as)) (16) 

The function Lap (x, y, as) is computed for a large set of scales, ay, corresponding 
to the scale-space feature signature. The maximum of the normalized derivatives, 
Lap^°‘^ {x, y, as = ao), along scales leads to the feature scale ctq. 

The equivariance property of blob feature scale, that enables blob feature re- 
cognition at different scales, is demonstrated in the following experiment. A set 
of images representing a scene at several scales is taken (two of them are shown 
in figure ni). A target is tracked along scales and its scale signature is shown in 
the central figure. The over-lined feature has a signature which translates with 
scale. These curves present a maximum over scale. The a parameter, which is 
characteristic to the local feature, is selected according to the observed maxi- 
mum. Thus, the local scale can be used to parameterize the normalized Gaussian 
derivatives described in the previous section. Maps of the selected scale para- 
meter a of two images of the chocos object are shown in figure El These maps 
show that the scale parameter distribution is preserved over the scanned scale 
range. Figure 01 shows the first normalized derivative along the x axis computed 
on three “chocos” images. For all images the derivatives up to order three are 
computed. Each local feature is described by a nine dimensional receptive field, 
which corresponds to a point in a nine dimensional descriptor space. This vector 
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Fig. 1. Automatic scale selection of a local feature for corresponde nee between two 
images. The curves present the evolution of normalized Laplacian with a. Circles indi- 
cate a radius of 2a (twice the selected scales parameter value). The ratio between the 
selected a gives the scale ratio between the local features and therefore in this example, 
an approximate scale ratio of 2 between the images. 




Fig. 2. Images of the selected scale parameter a of two images of the chocos object. 



is scale invariant and provides a means to obtain point correspondences between 
images at different scales. In the first image, four points have been selected which 
corresponds to four features vectors. In the next images, their correspondents 
are successively detected by searching the most similar feature vectors. 

5.2 Edge Features Scale Selection 

In order to design a receptive field that is invariant to orientation the dominant 
direction of an image neighborhood needs to be determined. An important pa- 
rameter for a reliable orientation normalization is the selected scale parameter. 
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Fig. 3. Points correspondences obtained by projecting the points on a nine dimensional 
feature space and matching similar vectors between the images. The gray values in the 
images correspond to the intensity of the first derivative. 



If this parameter is not appropriate the gradient information becomes instable. 
This results in an orientation error which makes classification difficult. 

The neighborhoods size is chosen such that the gradient in at least one di- 
rection is stable. An appropriate measure is the gradient norm. It is isotropic 
and returns a maximum energy when a stable gradient is present. If none of 
the gradient filters are stable within a maximum filter size, the neighborhood 
contains very low energy edge data. An orientation normalisation is unstable, 
but because of the lack of edge data, this does not perturb the recognition. 

In a previous experiment one single scale based on the Laplacian was selected 
for all derivatives. This scale is appropriate for blob features. Figure 0 compares 
the fj detected by the normalized gradient norm and the a detected by the 
normalized Laplacian. The graphics show that different cr are detected by the 
gradient norm and the scale normalized Laplacian. 

The graphic below the image shows an interesting case in figure 0 If the cr 
selected by the normalize Laplacian would b e applied for orientation detection, 
orientation errors can not be avoided. The scale where the normalize Laplacian 
has maximum energy, shows a normalized gradient norm with very weak energy. 
The a detected by the normalized Laplacian is appropriate for the purple blob 
between the two white lines, which is an uniform region. For a stable orientation 
detection a much bigger size must be selected to obtain enough edge information. 
This is an extreme case. More often the normalize Laplacian selects a size where 
there is some gradient energy (see graphics left and right of the image in figure 
Ej). However a higher gradient energy is found at a different scale. 



5.3 Scale and Orientation Invariant Description 

Lindeberg uses normalized derivatives for adaptively choosing the scales for blob, 
corners, edges or ridges scale detection. The selected scale is also a good cue 
for tuning the parameters of Gaussian derivatives for appearance description. 
Tuning Gaussian derivatives with the map of selected scales leads to a scale 
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Fig. 4. Comparison of a detection between using normalized gradient norm or norma- 
lize Laplacian. 



invariant description. And steering the filters to the dominant orientation leads 
to an orientation invariant description. 

The choice of the appropriate scale is an important parameter, and therefore 
must be chosen very carefully. The scales detected by the Laplacian adapt very 
well to blob features, where as the scale detected by the gradient norm designes a 
neighborhood such that a stable gradient can be found within this neighborhood. 
Both features are important for a reliable recognition and essential for scale and 
orientation invariant description. 

Figure 0 displays the energy differences of gradient norm energy and Lapla- 
cian energy over different scales. It can be observed that the two filter methods 
detect in all cases different scales. This is due to the fact that the filter methods 
adapt to either blob features or edge features. The presence of a blob feature and 
an edge feature of the same scale in the same neighborhood is a contradiction. 
As a consequence we investigate in the following experiments the impact of using 
both scales selected by the two methods in order to improve the stability in scale 
and orientation normalisation. 

Figure El displays this enhancement on an example. The orientation displayed 
in the angle image in figure EJd) obtained from the gradient norm scales is much 
more stable than the angle image in figure E|^c) obtained from the scales selected 
by the Laplacian. Many discontinuities can be observed in figure E^c) that are 
not present in figure EJd). This improvement in the orientation detection is due 
to the fact that the scale is based on derivatives of order one. 

To obtain a description which is both scale and orientation invariant two 
scales maps are used. The map of selected scales obtained with the gradient 
norm is used to parameterize odd derivatives and the map of selected scales 
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Fig. 5. (a) sigma image obtained from Laplacian (b) sigma image obtained from gra- 
dient norm (c) angle image resulting from sigma selected by Laplacian (d) angle image 
resulting from sigma selected by gradient norm. 



obtained with the Laplacian is used to parameterize even derivatives. The next 
section provides recognition results using such a scale and orientation invariant 
description. 

6 Application to Indexation 

6.1 Experiments on Scale and Orientation Detection 

For validation of the presented approach, automatic scale selection is applied 
to indexation. Two experiments are compared to show the stabilization of the 
orientation normalisation using gradient norm for scale selection. In the first 
experiment the scale is selected by the Laplacian. In the second experiment two 
scales are selected, one based on the gradient norm for the first derivatives, a 
second based on the Laplacian for the 2nd derivatives. 

A set of 13 images are taken of one single object. The object is rotated by 
15 degrees in-between two frames. One image is used for training. The other 12 
images are used for testing. 

The training is performed according to using local appearance de- 

scription by Gaussian derivatives, recursive filtering, automatic scale selection 
and orientation normalisation by steerable filters. At each point of the training 
image the most appropriate scale is detected using the two different strategies 
in the two experiments. Then the neighborhoods are normalized by orientation 
and a 8 dimensional filter response to the steered Gaussian derivatives is com- 
puted, which is stored in an indexation storage tree. The filter response serves 
to identify the neighborhood and its similar samples in the test images. 
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Overlapping neighborhoods are sampled of the test images with a step size 
of 3 pixels. At each point the orientation normalized filter response in the ap- 
propriate scale is computed. To evaluate the experiment only the training vector 
with the smallest distance is considered. In general the approach returns all vec- 
tors from the indexation tree that are within a sphere centered on the newly 
observed vector. It is a restriction to look only at the closest vector. The re- 
cognition rates are naturally lower than in a system which takes into account 
the entire list of hypotheses. Two values are computed. Firstly, the percentage 
that the closest vector is correct, is measured. This means, the answer obtained 
from the indexation tree indicates the correct neighborhood. Secondly, the aver- 
age error in the orientation normalisation is measured. This value indicates the 
precision of the orientation normalisation. 




Fig. 6. left: Percentage of correct first answers. The image at 0 degrees is the training 
image, right: average orientation error observed during indexation process. 



Figure 1^1 shows the results of the two experiments. The solid curve corre- 
sponds to the experiment in which only one scale was selected. The dashed 
curve corresponds to the experiment in which different scales were selected for 
first and second derivatives. The graphics show that the second technique pro- 
duces a higher percentage in correct first answers and a higher precision in the 
orientation normalisation. These two values illustrate the gain that is obtained 
by using the gradient norm for the scale selection of the first derivative. 



6.2 Object Recognition under Scale Variations 

This object recognition experiment is evaluated on a basis of 28 objects (figure 
□). One single image is learned. The results are shown on examples for objects 
“Chocos” and “Robot”, that are highlighted in figure Q Figures0and0show the 
recognition rates based on the receptive field responses. The first column displays 
the results for object “Chocos” and the second column shows the results for 
object “Robot”. For each object two graphs are presented. The first one shows 
recognition rates with a fixed scale parameter, and the second one represents 
the recognition rate with automatic scale selection. The algorithm returns a list 
of hypothesis ordered with increasing distance to the observed receptive field 
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Fig. 7. Set of images of objects seen at different scales. 



response. The three curves in the graphs are computed taking into account this 
hypothesis list and corresponds to three recognition cases: 

— [a] The object corresponding to the answer with the smallest distance is 
correct. 

— [6] The correct object is among the list of hypothesis, but other objects have 
a smaller distance. 

— [c] Percentage of accepted neighborhoods. The description vector is rejected 
due to missing discrimination. The list of hypothesis is either empty or too 
large to be processed. 

The percentage of accepted neighborhoods is very low for recognition in the 
case of a fixed scale parameter. Some neighborhoods have a quasi constant grey 
level, which leads to a very long list of hypothesis. These neighborhoods are am- 
biguous and not suitable for recognition. The automatic scale selection increases 
the percentage of accepted neighborhoods, because the scale is adapted to the 
feature. The figure 0 shows that recognition rate is unsatisfactory for scale va- 
riations above 20% whereas in figure 0 the recognition rate remains above 50%. 
Recognition is possible using a voting or a prediction-verification algorithm. 



7 Conclusion 

The appearance of features depends upon the scale of observation. In order to 
capture a maximum number of features, the scale of observation needs to be 
variable. This paper has shown very promising results of recognition under va- 
riable scale. Lindeberg proposes automatic scale selection for the determination 
of the appropriate feature scale. He assumes that maxima over scales of norma- 
lized derivatives reflect the scale of patterns. The selected scale corresponds to 
the Gaussian scale parameter at which the inner product of the derivative ope- 
rator and the local image signal gives the strongest response. The application 
of this approach to all image neighborhoods allows the recognition of objects at 
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Fig. 8. Recognition rate of object seen at different scales. The graphs abscissa is the 
scale ratio between the analysed image and the model image. Left graphs deals with the 
object “Chocos” and the right ones deals with the “Robot” object. Recognition is done 
with a fixed scale parameter. 
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Fig. 9. Recognition rate of object seen at different scales. The graphs abscissa is the 
scale ratio between the analysed image and the model image. Left graphs deals with the 
object “Chocos” and the right ones deals with the “Robot” object. Recognition is done 
with automatic scale selection. 



different scales. A map of appropriate scales is obtained that can be used to nor- 
malize the receptive fields. With steerable filters the dominant orientation of a 
neighborhood can be detected, which results in orientation invariance. Scale and 
orientation invariance are achieved by normalizing local descriptors. A remarka- 
ble gain in recognition and in the precision of the orientation normalization is 
achieved compared to the approach in which the feature type is ignored. 

These results demonstrate that the approach is promising for real-world ap- 
plications. The precise performances of the approach still requires a theoretical 
and quantitative evaluation, taking into account that the scale selection fails in 
some cases. There rest problems with point correspondences between images, 
in which the objects undergo important scale variations (factor 5 and more). 
Recognition requires that local patterns remain in a valid scale range between 
images. This range has to be evaluated and then the scale invariant recognition 
scheme must be applied on a set of images featuring high scale variations (larger 
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than three) . The proposed approach has been tested with good results for image 
pairs with scale factor up to three. 

Another interesting case is the detection of several characteristic scales for 
one feature. In this case several local extrema in the normalized Laplacian curves 
are present in function of the parameter cr. Considering all detected scales of a 
feature leads to a description that preserves a higher amount of information, 
which can result in a superior recognition system. 
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Abstract. This paper offers a novel detection method, which works 
well even in the case of a complicated image collection - for instance, 
a frontal face under a large class of linear transformations. It was also 
successfully applied to detect 3D objects under different views. Call the 
class of images, which should be detected, a multi-template. 

The detection problem is solved by sequentially applying very simple 
hlters (or detectors), which are designed to yield small results on the 
multi-template (hence “anti-faces”), and large results on “random” na- 
tural images. This is achieved by making use of a simple probabilistic 
assumption on the distribution of natural images, which is borne out 
well in practice, and by using a simple implicit representation of the 
multi-template. 

Only images which passed the threshold test imposed by the first de- 
tector are examined by the second detector, etc. The detectors have the 
added bonus that they act independently, so that their false alarms are 
uncorrelated; this results in a percentage of false alarms which exponen- 
tially decreases in the number of detectors. This, in turn, leads to a very 
fast detection algorithm, usually requiring (1 -|- S)N operations to clas- 
sify an A-pixel image, where S < 0.5. Also, the algorithm requires no 
training loop. 

The suggested algorithm’s performance favorably compares to the well- 
known eigenface and support vector machine based algorithms, and it is 
substantially faster. 
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1 Introduction 

The well-known template detection problem in computer vision is: given a (usu- 
ally small) image - the template - T, and a (usually much larger) image P, 
determine if there are instances of T in P, and if so, where. A typical scenario 
is: given a photograph of a face, and a large image, determine if the face appears 
in the image. 

This problem may be solved by various methods, such as cross-correlation, 
or Fourier-based techniques mm- A more challenging problem is what we 
call multi-template detection. Here, we are given not one template T, but a class 
of templates T (which we call a multi-template), and are required to answer 
the more general question: given a large image P, locate all instances of any 
member of 7” within P. Obviously, if T can be well represented by m templates, 
we could apply the standard template detection techniques m times, and take the 
union of the results. This naive approach, however, breaks down in complexity 
for large m. The goal of this research is to develop an efficient algorithm for 
multi-template detection. 

Typical cases of interest are: 

— Given an image, locate all instances of human faces in it. 

— Given an aerial photograph of an airfield, locate all instances of an airplane 
of a given type in it. If we do not know the angle at which the airplanes are 
parked, or the position from which the photograph was taken, then we have 
to locate not a fixed image of the airplane, but some affinely distorted version 
of it. If the photograph was taken from a relatively low altitude, we may have 
to look for perspective distortions as well. In this case, the multi-template 
consists of a collection of affinely (perspectively) distorted versions of the 
airplane, and it can be well-approximated by a finite collection of distorted 
versions, sampled closely enough in transformation space (obviously, one 
will have to limit the range of distortions; say, allow scale changes only at a 
certain range, etc.). 

— Locate different views of a three-dimensional object in a given image. 



1.1 Structure of the Paper 

We proceed to define some relevant concepts, and outline the idea behind the de- 
tection scheme offered in this work. After surveying some related research, we lay 
the mathematical foundation for the anti-face algorithm. Following that, some 
experimental results are presented, and compared with eigenface and support 
vector machines based methods. 

1.2 “Quick Rejection” vs. Detection and Recognition 

The detection/recognition problem has a few stages, which converge to the so- 
lution. We term them as follows: 

— The quick rejection stage: here, one tries to find a fast algorithm that filters 
out most input images that are not in the multi-template P ; usually, since 
we are searching a large image, the majority of input images - which, in this 
case, are the sub-images of the large image - will not be in T. The quick 
rejection stage has to fulfill three requirements: 
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1. It should be fast, due to the large amount of input images. 

2. It should not classify a member of T as a non-member. 

3. It should classify as little as possible non-members as members. 

It is well-known, for instance from the extensive research on the eigenface 
method pm?! that, in some cases, T can be reasonably approximated by a 
linear subspace F, whose dimension k is quite smaller than the dimension 
of the ambient Euclidean space in which T’s images reside. In the eigenface 
method, the quick rejection stage consists of casting aside the images whose 
distance from F is larger than a certain threshold, and it takes 0{k) convo- 
lutions to compute this distance E0|. However, in some cases k turns out to 
be quite large, as will be demonstrated in the sequel. 

— The detection stage: here, one has to filter out the errors of the quick rejection 
stage - that is, detect the non-members of T which were not screened out 
in the quick rejection stage. 

— The recognition stage: this optional stage consists of identifying and labeling 
7”’s members. For instance, after human faces have been detected in an 
image, one may wish to identify the corresponding individuals. However, 
there are cases in which recognition is not important or relevant; for instance, 
in detecting airplanes of a specific type, which are parked in an airfield, 
the recognition problem may be meaningless, since there is no distinction 
between airplanes parked at different angles. 

This work addresses only the quick rejection stage. However, our empirical 
results were usually good enough to make the detection stage superfluous. The- 
refore we shall hereafter allow a slight abuse of terminology, by referring to the 
algorithm proposed here as a detection algorithm, and to the filters developed 
as detectors. 



1.3 A Short Description of the Motivation Behind the Anti-Face 
Algorithm 

The idea underlying the detection algorithm suggested here is very straightfor- 
ward, and makes use of the fact that often an implicit representation is far more 
appropriate than an explicit one, for determining whether a certain element 
belongs to a given set. 

Suppose, for instance, that we wish to determine whether a point in the plane, 
(x,y), belongs to the unit circle. Naturally, we will simply compute the value of 
+ y'^ — 1; that is, we make use of the fact that there exists a simple functional, 
which assumes a value of zero on the set - and only on it. We could also use 
this fact to test whether a point is close to the unit circle. Let us term the idea 
of determining membership in a set S, by using functionals which obtain a very 
small value on S, as the implicit approach, and the functionals will be called 
separating functionals for S. Note that this definition is more general than the 
classic “separating hyperplane” between classes; we are not trying to separate the 
set from its complement, or from some other set, by a hyperplane, or a quadratic 
surface etc., but characterize it by the range of values that certain functionals 
obtain on it. In general, this characterizes the set accepted by the detection 
process as semi-algebraic (if we restrict our functionals to be polynomials), or 
some other set, which is defined in the most general manner by: 
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m 

n (1) 

Z=1 

where fi are the separating functionals, and the e* are small. For the unit circle, 
for instance, one separating functional is required: + y'^ — 1. 

So, to test whether a point (or, in our case, an image) x belongs to S, one 
has to verify that, for every 1 < z < m, \fi{x)\ < e^. The decision process can 
be shortened by first checking the condition for fi , and applying /2 only to the 
points (images) for which |/i(a;)| < ei, etc. 

This very general scheme offers an attractive algorithm for detecting S, if 
the following conditions hold: 

— A D 5. This is crucial, as S should be detected. 

— m is small. 

— The separating functionals fi are easy to compute. 

— If y ^ S, there is a small probability that \fi{y)\ < f-i for every i. 

Now, suppose one wishes to extend the implicit approach to the problem of 
quick rejection for a multi-template 'T. Let us from here on replace “separating 
functional” by the more intuitive term detector. 

Images are large; it is therefore preferable to use simple detectors. Let us 
consider then detectors which are linear, and act as inner products with a given 
image (viewed as a vector). For this to make sense, we have to normalize the 
detectors, so assume that they are of unit length. If |((i, t)| is very small for 
every t G T, then f{y) = \ (d, y) \ is a candidate for a separating functional for T. 
However, if we just choose such a few “random” di, this naive approach fails, as 
\{di, y)\ is very small also for many images y which are not close to any member 

of r. 

Let us demonstrate this by an example. The object that has to be detected is 
a pocket calculator, photographed at an unknown pose, from an unknown angle, 
and from a range of distances which induces a possible scaling factor of about 
0.7 — 1.3 independently at both axis. Thus, T consists of many projectively di- 
storted images of the pocket calculator. Proceeding in a simplistic manner, we 
may try to use as detectors a few unit vectors, whose inner product with every 
member of T is small; they are easy to find, using a standard SVD decompo- 
sition of T’s scatter matrix, and choosing the eigenvectors with the smallest 
eigenvalues. In Figs. 1-2 we show the result of this simplistic algorithm, which ~ 
not surprisingly - fails: 






pill 




Fig. 1. Two of the members of the pocket calculator multi-template, and three of the 
“simplistic” detectors. 
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Fig. 2. The failure of the “simplistic” detectors, depicted in Fig. 1, to correctly locate 
the pocket calculator. Detection is marked by a small bright square at the upper left 
corner of the detected image region. Not only are there many false alarms, but the 
correct location is not detected. In the sequel, it will be shown that very accurate 
detection can be achieved by using better detectors. 



Figs. 1-2 demonstrate that it is not enough for the detectors to yield small 
values on the multi-template T ; while this is satisfied by the detectors depicted 
in Fig. 1, the detection results are very bad. Not only are many false alarms 
present, but the correct location is missed, due to noise and the instability of 
the detectors. More specifically, the detection fails because the detectors also 
yield very small results on many sub-images which are not members of T (nor 
close to any of its members) . Thus, the detectors have to be modified so that they 
will not only yield small results on T’s images, but large results on “random” 
natural images. 

To the rescue comes the following probabilistic observation. Most natural 
images are smooth. As we will formally prove and quantify in the sequel, the 
absolute value of the inner product of two smooth vectors is large. If c? is a 
candidate for a detector to the multi-template T, suppose that not only is \ {d, t)\ 
small for t G T, but also that d is smooth. Then, \i y ^ T, there is a high 
probability that |(d, y)\ will be large; this allows us to reject y, that is, determine 
that it is not a member of T. 

In the spirit of the prevailing terminology, we call such vectors d “anti-faces” 
(this does not mean that detection is restricted to human faces). Thus, a can- 
didate image y will be rejected if, for some anti-face d, |(d, y)| is larger than 
some d-specific threshold. This is a very simple process, which can be quickly 
implemented by a rather small number of inner products. Since the candidate 
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image has to satisfy the conditions imposed by all the detectors, it is enough to 
apply the second detector only to images which passed the first detector test, 
etc; in all cases tested, this resulted in a number of operations less than 1.5iV 
operations, for an A^-pixel candidate image. In the typical case in which all the 
sub-images of a large image have to be tested, the first detector can be applied 
by convolution. 

2 Previous Work 



Most detection algorithms may be classified as either intensity-based or feature- 
based. Intensity-based methods operate directly on the pixel gray level intensi- 
ties. In contrast, feature-based methods first extract various geometric cues from 
the raw image, then perform higher-level reasoning on this geometric informa- 
tion. 

Previous work on multi-template detection includes a large body of work 
on recognition of objects distorted under some geometric transformation group, 
using invariants Some intensity-based methods use moment invariants for 
recognition of objects under Euclidean or affine transformations j^. One dif- 
ficulty with these methods is that one has to compute the local moments of 
many areas in the input image. Also, moment-based methods cannot handle 
more complex transformations (e.g. there are no moment invariants for projec- 
tive transformations, or among different views of the same three-dimensional 
object). 

Feature-based algorithms jS] have to contend with the considerable difficulty 
of locating features in the image. Methods that use differential invariants 
and thus require computing derivatives, have to overcome the numerical difficul- 
ties involved in reliably computing such derivatives in noisy images. 

Of the intensity-based methods for solving the multi-template detection pro- 
blem, the eigenface method of Turk and Pentland has drawn a great 

deal of attention. This method approximates the multi-template T by a low- 
dimensional linear subspace F, usually called the face space. Images are classi- 
fied as potential members of T if their distance from F is smaller than a certain 
threshold. 

The eigenface method can be viewed as an attempt to model T’s distribution. 
Other work on modeling this distribution includes the study of the within-class 
vs. “general” scatter I1I18I17I . and a more elaborate modeling of the probability 
distribution in the face class jZ|. In |E), eigenfaces were combined with a novel 
search technique to detect objects, and also recover their pose and the ambient 
illumination; however, it was assumed that the objects (from the COIL database) 
were already segmented from the background, and recognition was restricted to 
that database. 

The eigenface method has been rather successful for various detection pro- 
blems, such as detecting frontal human faces. However, our experiments have 
suggested that once a large class of transformations comes into play - for in- 
stance, if one tries to detect objects under arbitrary rotation, and possibly other 
distortions - the eigenface method runs into problems. This was reaffirmed by 
one of the method’s inventors Ga- 
in an attempt to apply the eigenface principle to detection under linear trans- 
formations Ga, a version of the eigenface method is applied to detect an object 
with strong high-frequency components in a cluttered scene. However, the range 
of transformations was limited to rotation only, and only at the angles —50*^ 
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to 50°. The dimension of the face space used was 20. We will show results for 
a far more complicated family of transformations, using a number of detectors 
substantially smaller than 20. 

Neural nets have been applied, with considerable success, to the problem of 
face detection d, and also of faces under unknown rotation m- It is not clear 
whether the methods used in d can be extended to more general transfor- 
mation groups than the rotation group, as the neural net constructed there is 
trained to return the rotation angle; for a family of transformations with more 
than one degree of freedom, both the training and the detection become far 
more complicated, because both the size of the training set, and the net’s set of 
responses, grow exponentially with the number of degrees of freedom. 

Support vector machines (SVM’s) jl IKlIIDIlT^ are conceptually the method 
closest in spirit to the method suggested in this paper. An SVM consists of a 
function G which is applied to each candidate image t, and it classifies it as 
a member of the multi-template T or not, depending on the value of G{t). A 
great deal of effort has been put into finding such a function which optimally 
characterizes T. A typical choice is 

i 

G{t) = sgn(^ XiyiK{t, Xi) + b) 

i=l 

where t is the image to be classified, Xi are the training images, yi is 1 or -1 
depending on whether Xi is in T (or a training set for T) or not, and K{) a, 
“classifier function” (for example, K{t,Xi) = exp(— ||f — Xi\\^)). Usually, a fun- 
ction is sought for which only a relatively small number of the Xi are used, 
and these xi are called the support vectors. Thus, the speed of SVM’s depends 
to a considerable extent on the number of support vectors. The Xi are reco- 
vered by solving an optimization problem designed to yield a best separating 
hyperplane between T and its complement (or possibly between two different 
multi-templates). SVM’s were introduced by Vapnik m. and can be viewed as a 
mechanism to find the optimal separating hyperplane, either in the space of the 
original variables, or in a higher-dimensional “feature space” . The feature space 
consists of various functions of the components of the original t vectors, such 
as polynominals in these components, and allows for a more powerful detection 
scheme. 

As opposed to SVM’s and neural nets, the method suggested here does not 
require a training loop on negative examples, because it makes an assumption on 
their statistics - which is borne out in practice - and uses it to reduce false alarms 
(false alarms are cases in which a non-member of T is erroneously classified as 
a member). 

3 The “Anti-Face” Method: Mathematical 
Foundation 

To recap, for a multi-template T, the “anti-face detectors” are defined as vectors 
satisfying the following three conditions: 

— The absolute values of their inner product with T’s images is small. 

— They are as smooth as possible, so as to make the absolute values of their 
inner product with “random” images large; this is the characteristic which 
enables them to separate of T’s images from random images. This will be 
formalized in Section 3.1. 
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— They act in an independent manner, which implies that their false alarms are 
uncorrelated. As we shall prove, this does not mean that the inner product of 
different detectors is zero, but implies a slightly more complicated condition. 
The independence of the detectors is crucial to the success of the algorithm, 
as it results in a number of false alarms which is exponentially decreasing in 
the number of detectors. This is explained in Section 3.2. 

Once the detectors are found, the detection process is straightforward and 
very easy to implement: an image is classified as a member of T iff the abso- 
lute value of its inner product with each detector is smaller than some (detector 
specific) threshold. This allows a quick implementation using convolutions. Ty- 
pically, the threshold was chosen as twice the maximum over the absolute values 
of the inner products of the given detector with the members of a training set 
for T. This factor of two allows to detect not only the members of the training 
set (which is a sample of T), but also images which are close to them, which 
suffices if the training set is dense enough in T. 

A schematic description of the detection algorithm is presented in Fig. 3. 



Schematic Description of the Detection 

“Direction of smoothness” -r 

I I Templates 
I Natural images 



Eigenfaoe method positive set 

Anti-face method positive set 

Fig. 3. Schematic description of the algorithm. 




3.1 Computing the Expectation of the Inner Product 

Let us proceed to prove that the absolute value of the inner product of two 
“random” natural images is large (for the statement to make sense, assume 
that both images are of zero mean and unit norm). The Boltzman distribution, 
which proved to be a reasonable model for natural images m, assigns to an 
image I a probability proportional to the exponent of the negative of some 
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“smoothness measure” for I. Usually, an expression such as fj(l^ + Iy)dxdy, 
or + Iyy)dxdy, is used 02]. It is preferable, for the forthcoming 

analysis, to work in tne frequency domain, since then the smoothness measure 
operator is diagonal, hence more manageable. The smoothness of a (normalized) 
n X n image I, denoted S{I), is defined by 



s{i)= Y. 

(fe,i)/(0,0) 






( 2 ) 



(note that is small for smooth images), and its probability is defined, following 
the Boltzman distribution, as 



Pr{I) oc exp(— S'(/)) (3) 

where I(fc, 1) are the DCT (Discrete Cosine Transform) coefficients of I. If the 
images are normalized to zero mean, 1(0,0) = 0. This definition is clearly in 
the spirit of the continuous, integral-based definitions, and assigns higher proba- 
bilities to smoother images. Hereafter, when referring to “random images”, we 
shall mean “random” in this probability space. Now it is possible to formalize 
the observation “the absolute value of the inner product of two random images 
is large” . For a given image F, of size nx n, the expectation of the square of its 
inner product with a random image equals 

E[{Fjf]= f {F,lfPr{I)dI 

7?,"Xn 



using Parseval’s identity, this can be computed in the DCT domain. Substituting 
the expression for the probability (Eq. 3), and denoting the DCT transforms of 
F and I hy F and I respectively, we obtain 

/ n 

( Y d^{k,l)T{k,l)fex:p{- Y {k^ + l‘^)Ak^))dI 

7Jnxn-l (fc.O/(o.o) (fc.O/(o.o) 



which, after some manipulations, turns out to be proportional to 



E 

(fc.O^(o.o) 



F^{k,l) 

(P + P)3/2 



(4) 



Since the images are normalized to unit length, it is obvious that, for the expres- 
sion in Eq. 4 to be large, the dominant values of the DCT transform {F{k,l)} 
should be concentrated in the small values of k, I - in other words, that F be 
smooth. 

This theoretical result is well-supported empirically. In Fig. 4, the empirical 
expectation of (E, 7)^ is plotted against Eq. 4. The expectation was computed 
for 5,000 different E, by averaging their squared inner products with 15,000 sub- 
images of natural images. The size was 20 x 20 pixels. The figure demonstrates 
a reasonable linear fit between Eq. 4 and the empirical expectation: 
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Relationship between theoretical and empirical expectation 
of squared inner product with detector d 

E[{d,I)^] 




01 0 08 ~2 

^ d Q,j) 



Fig. 4. Empirical verification of Eq. 4. 



3.2 Forcing the Detectors to be Independent 

It is difficult to expect that one detector can detect T, without many false alarms. 
This is because, for a single detector d, although |(d, y)| is large on the average 
for a random image y, there will always be many random images / such that 
|(fi, J)| is small, and these images will be erroneously classified as members of T. 
The optimal remedy for this is to apply a few detectors which act independently, 
that implies that if the false alarm rate (= percentage of false alarms) of d\ is 
Pi, and that of d^ is P2, then the false alarm rate for both detectors will be P\P2- 
Since the entire detection scheme rests on the probability distribution defined in 
Eq. 3, the notion of independence is equivalent to the requirement that the two 
random variables, defined by / — >■ (/, di) and / — >■ (J, ^2), be independent, or 

J {I,di){I,d2)Pr{I)dI = 0 

T^nxn 



where Pr{I) is defined as in Eq. 3. Denote this integral by (^1,^2)*; it turns out 
to be 



(dl,d2)*= ^ 



Vi{k,l)V 2 {k,l)) 

(fc2 + 12 ) 3/2 



(k, 1 )^( 0 , 0 ) 

where T>i and T>2 are the DCT transforms of di and d2- 



( 5 ) 
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3.3 Finding the Detectors 

To find the first anti-face detector, di, the following optimization problem is 
solved: 

1. d\ has to be of unit norm. 

2. |((ii,t)| should be small, for every image t in the training set for the multi- 
template T. 

3. di should be as smooth as possible under the first and second constraints, 
which will ensure that the expression in Eq. 4 will be large. 

The solution we implemented proceeds as follows. First, choose an appro- 
priate value for m^|((ii,f)|; experience has taught us that it doesn’t matter 

much which value is used, as long as it is substantially smaller than the absolute 
value of the inner product of two random images. Usually, for images of size 
20 X 20, we have chosen this maximum value - denoted by M - as 10“®. If it is 
not possible to attain this value - which will happen if T is very rich - choose a 
larger M . Next, minimize 



m^ |(di, t)| -I- XS{di) 

and, using a binary search on A, set it so that m^ |(di, t)| = M. 

After d\ is found, it is straightforward to recover d 2 ] the only difference is 
the additional condition {d\,d 2 )* = 0 (see Eq. 5), and it is easy to incorporate 
this condition into the optimization scheme. The other detectors are found in a 
similar manner. 

We have also implemented a simpler algorithm, which minimizes the quadra- 
tic target function -I- AS'(c?i). The resulting detectors are suboptimal, 

ter 

but usually 30% more such detectors will yield the same performance as the 
optimal ones. 

4 Experimental Results 

We have tested the anti-face method on both synthetic and real examples. In Sec- 
tion 4.1, it is compared against the eigenface method for the problem of detecting 
a frontal face subject to increasingly complicated families of transformations. In 
these experiments, the test images were synthetically created. In the other ex- 
periments, the anti-face method was applied to detect various objects in real 
images: a pocket calculator which is nearly planar, and the well-known COIL 
database of 3D objects, photographed in various poses. Results for the COIL 
objects were compared with those of support vector machine based algorithms. 

The results for the COIL database are not presented here, due to lack of 
space. Readers interested in a more complete version of this work, which includes 
these results, are welcome to mail the first author. 

~ A note on complexity: recall that each detector only tests the input images 
which have passed all the thresholds imposed by the preceding detectors. 
If, say, eight anti-face detectors were used, that does not mean that the 
number of operations for an input image with N pixels was 8iV. If, for 
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example, the average false alarm rate for the detectors is 30%, then 70% of 
the input images will be discarded after the first detector, hence require N 
operations; of the 30% which pass the first detector, 21% will not pass the 
second detector, hence they’ll require 2N operations, etc. Thus, the average 
number of operations per input image will be roughly (0.7 • 1 -I- 0.21 -2 -1-0.063 • 
3+...)N . In all our experiments, no more than 1.5N operations were required 
for classifying an iV-pixel input image. Note that this analysis assumes that 
the large majority of input images are false alarms, a reasonable assumption 
if one searches all the sub-images of a large image. 



4.1 Performance as Function of Multi-template’s Complexity 

In order to test the performance of the anti-face method with multi-templates of 
increasing complexity, we have created the following three multi-templates, each 
of which consists of a family of transformations applied to the frontal image of 
a face (20 x 20 pixels). The background consisted of other faces. 

— Rotation only. 

~ Rotation and uniform scale at the range 0.7 to 1.3. 

— The subgroup of linear transformations spanned by rotations and indepen- 
dent scaling at the x and y axis, at the range 0.8 to 1.2. 

In order to estimate the complexity of these multi-templates, we created the 
scatter matrix for a training set of each, and computed the number of largest 
eigenvalues whose sum equals 90% of the sum of all 400 eigenvalues. This is a 
rough measure of the “linear complexity” of the multi-template. 

Ten images from each multi-template were then super-imposed on an image 
consisting of 400 human faces, each 20 x 20 pixels, and both the eigenface and 
anti-face algorithms were applied. These ten images were not in the training set. 

Interestingly, while the eigenface method’s performance decreased rapidly as 
the multi-template’s complexity increased, there was hardly a decrease in the 
performance of the anti-face method. The next table summarized the results: 



Algorithm’s Performance 


Rotation 


Rotation -|- Scale 


Linear 


Number of Eigenvalues 
Required for 90% Energy 


13 


38 


68 


Eigenfaces Performance: 
Dimension of Face Space 
Required for Accurate Detection 


12 


74 


145 


Anti-Face Performance: 








Number of Detectors Required 
for Accurate Detection 


3 


4 


4 



Independence of the Detectors For the case of linear transformations (most 
complicated multi-template), the false alarm rates for the first, second, and third 
detectors respectively were pi = 0.0518, p 2 = 0.0568, and ps = 0.0572; the false 
alarm rate for the three combined was 0.00017 - which is nearly equal to piP 2 P 3 - 
This proves that the detectors indeed act independently. With four detectors, 
there were no false alarms. 
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Detectors and Results Some of the images in the multi-template are now 
shown, as well as the first four detectors and the detection result of the anti- 
face method, and also the result of the eigenface method with a face space of 
dimension 100. 




Fig. 5. Sample 20 x 20 pixel templates, and the first three anti-face detectors. 




Fig. 6. Detection of “Esti” face, anti-face method (left), and eigenface method with a 
face space of dimension 100 (right). 



4.2 Detection of Pocket Calculator 

In this set of experiments, the problem of detecting a pocket calculator photo- 
graphed from different angles and distances was tackled. Here, too, the anti-face 
method performed well, and eight detectors sufficed to recover the object in all 
the experiments without false alarms, which was substantially faster than the 
eigenface method. 
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Fig. 7. Detection of pocket calculator, anti-face method (left), and eigenface method 
with a face space of dimension eight (right). 



5 Conclusions and Further Research 

A novel detection algorithm - “anti-faces” - was presented, and successfully 
applied to detect various image classes, of the type which often occur in real-life 
problems. The algorithm uses a simple observation on the statistics of natural 
images, and a compact implicit representation of the image class, to very quickly 
reduce false alarm rate in detection. In terms of speed, it proved to be superior 
to both eigenface and support vector machine based algorithms. 

We hope to extend the anti-face paradigm to other problems, such as detec- 
tion of 3D objects under a larger family of views, and event detection. 
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Abstract. We propose a model of appearance and a matching method 
which combines ‘global’ models (in which a few parameters control global 
appearance) with local elastic or optical-flow-based methods, in which 
deformation is described by many local parameters together with some 
regularisation constraints. We use an Active Appearance Model (AAM) 
as the global model, which can match a statistical model of appearance 
to a new image rapidly. However, the amount of variation allowed is 
constrained by the modes of the model, which may be too restrictive 
(for instance when insufficient training examples are available, or the 
number of modes is deliberately truncated for efficiency or memory con- 
servation). To compensate for this, after global AAM convergence, we 
allow further local model deformation, driven by local AAMs around 
each model node. This is analogous to optical flow or ‘demon’ methods 
of non-linear image registration. We describe the technique in detail, and 
demonstrate that allowing this extra freedom can improve the accuracy 
of object location with only a modest increase in search time. We show 
the combined method is more accurate than either pure local or pure 
global model search. 



1 Introduction 

The ability to match a model to an image rapidly and accurately is very im- 
portant for many problems in computer vision. Here we are concerned with 
parametrised deformable models, which can synthesise new images of the ob- 
ject (s) represented. For instance, a model of facial appearance should be able 
to synthesise new face images under a range of simulated viewing conditions. 
Matching such a model to a new image involves finding the parameters which 
synthesise something as close as possible to the target image. The model para- 
meters then describe the target object, and can be used for further processing 
(tracking, measurement, recognition etc). 

There have been many attempts to construct such models of appearance. 
These can be broken down into two broad categories. Those that have a rela- 
tively small number of parameters which control global appearance properties. 
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and those that have a relatively large number of parameters controlling local ap- 
pearance properties. The former class includes the Morphable Models of Jones 
and Poggio 0 , the face models of Vetter the shape and appearance models 
of Cootes et.al.^, the intensity surface models of Nastar et.al. CDI and Turk and 
Pentland’s Eigenfaces m amongst others. Here changing any one model para- 
meter can change the whole shape or appearance. The second class, that of ‘local‘ 
models, includes the various non-linear image registration schemes in which the 
deformation is described by a dense flow field or the positions of a set of grid 
nodes (these positions are essentially the parameters of the model). Examples 
of such algorithms are reviewed in |B|, and include the work of Christensen P, 
Collins et.aZ.[2|, Thirion HU and Lester et.al.^ amongst others. 

Global models are not always general enough to represent all new examples 
accurately. The local models tend not to be specific enough. With no global con- 
straints they can match to illegal examples. Here we propose a model combining 
both global and local constraints. The global deformation model is a statistical 
model of appearance as described by Lanitis et.al . This comprises a shape 
and a texture model trained on a set of landmarked images. A small number of 
parameters can control both the shape and the texture, allowing approximation 
of any of the training set and generalisation to unseen examples. Such a model 
can be matched to a new image rapidly using the ‘Active Appearance Model’ 
algorithm 0. However, without using a large number of model parameters, such 
a model may not be able to locate accurately a boundary which displays a lot of 
local variation. To use a large number of parameters can be inefficient, as essen- 
tially local shape or texture deformations become represented as global modes 
of variation. 

To overcome this we allow smooth local deformations of the shape around the 
global model shape. Smoothness is imposed during search by explicitly smoo- 
thing any suggested local deformations. Local deformations can be calculated 
rapidly using a modification of the AAM algorithm in a generalisation of optical 
flow or Thirion’s ‘demons’ approach HU- Rather than use local image gradi- 
ents to drive local neighbourhood matching, we explicitly learn the relationship 
between local model displacements and the difference image induced. We demon- 
strate that allowing local deformation can lead to more accurate matching, and 
that it can be more effective than simply adding more global modes. We describe 
the algorithm in detail, and show results of experiments testing its performance. 



2 Background 

The algorithm described below is a combination of a deformable statistical mo- 
del matching algorithm with optical- flow-like image registration techniques. For 
a comprehensive review of work in these fields there are surveys of image regi- 
stration methods and deformable models in medical image analysis |||S|- We 
give here a brief review of more recent relevant work. 
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The closest work to ours are the ‘Multidimensional Morphable Models’ of Jo- 
nes and Poggio 0 . These are linear models of appearance which can be matched 
to a new image using a stochastic optimisation method. The model is built from 
a set of training images using a boot-strapping algorithm. The current model 
is matched to a new image, optical flow algorithms are used to refine the fit. 
This gives the correspondences on the new image, allowing it to be added to 
the model. This boot-strapping step, that of using optical flow style algorithms 
to improve the match of a model, is similar to the approach we describe in this 
paper. However, our use of the A AM search algorithm allows much more rapid 
matching, and we propose that the local search can be an integral part of a 
multi-resolution search algorithm, rather than just used during model training. 

Thirion ffH expresses image registration in a framework of diffusing models, 
and demonstrates how this can lead to various image matching algorithms. The 
‘local AAMs’ described below are essentially one incarnation of his ‘demons’. 

Lester et.aZ.0 describe a non-linear registration algorithm modelling image 
deformation as fluid flow. They only allow mapping from one image to another, 
and require a fairly complex prior description of how different parts of the image 
are allowed to move. 

Christensen ^ demonstrates pair-wise image registration using symmetric 
transformations which are guaranteed to be consistent (ie the mapping from 
image A to image B is the inverse of that from B to A) . He uses a fourier basis 
for the allowed deformations, and shows how a consistent mean image can be 
built by repeatedly mapping different images to the current mean. However, the 
algorithm as described is an image to image, rather than a model to image map. 

Collins et.aZ.pi register two images by building a non-linear deformation held, 
recursively matching local spherical neighbourhoods. Deformations are only con- 
strained to be locally smooth, imposed by explicit smoothing of the displacement 
held. 

An alternative method of introducing extra flexibility into a linear shape 
model is to add variation to mimic elastic vibration modes 0, or to add extra 
artificial covariance to the model to induce such modes m However, this can 
generate large numbers of global modes which can become too computationally 
expensive to deal with efficiently. By allowing smooth local deformations, we 
reduce the computational load considerably. 



3 Active Appearance Models 

3.1 Statistical Appearance Models 

A statistical appearance model can represent both the shape and texture varia- 
bility seen in a training set m- The training set consists of labelled images, 
where key landmark points are marked on each example object. For instance, 
to build a model of the central brain structures in 2D MR images of the brain 
we need a number of images marked with points at key positions to outline the 
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main features (Figure^. Similarly a face model requires labelled face images 
(Figure E|) • 




Fig. 1. Example of MR brain slice label- Fig. 2. Example of face image labelled 
led with 123 landmark points around the with 68 landmark points 
ventricles, the caudate nucleus and the len- 
tiform nucleus 

Given such a set we can generate a statistical model of shape and texture 
variation (see 0 for details). Shape can be represented as a vector x of point 
positions and the texture (or grey-levels) represented as a vector g. The ap- 
pearance model has a vector of parameters c controlling the shape and texture 
according to 



x = x-hQsC , . 

g = g + QgC 

where x is the mean shape, g the mean texture and Qs,Qg are matrices descri- 
bing the modes of variation derived from the training set. 

The model points, x, are in a shape model frame and are mapped into the 
image frame by applying a euclidean transformation with parameters t, Tt(x). 
Similarly the texture vector, g, is in a normalised texture frame. The actual 
image texture is obtained by applying a scaling and offset transformation with 
parameters, u, Th(g). 

An example image can be synthesised for a given c by generating a texture 
image from the vector g and warping it using the control points described by 
X (we currently use a piecewise affine transformation based on a triangulation 
of the control points). For instance. Figure 0 shows the effects of varying the 
first two appearance model parameters, ci, C 2 , of a model trained on a set of 
face images, labelled as shown in FigureEl These change both the shape and the 
texture components of the synthesised image. 

4 Active Appearance Model Matching 

We treat the process of matching an appearance model to an image as an optimi- 
sation problem. Given a set of model parameters, c, we can generate a hypothesis 
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Cl varies by ±2 s.d.s C 2 varies by ±2 s.d.s 

Fig. 3. First two modes of an appearance model of a face 



for the shape, x, and texture, gm, of a model instance. To compare this hypothe- 
sis with the image, we use the suggested shape to sample the image texture, gs, 
and compute the difference, 5g = gs — gm ■ We seek to minimise the magnitude 
of |i5g|. 

This is potentially a very difficult optimisation problem, but we exploit the 
fact that whenever we use a given model with images containing the modelled 
structure the optimisation problem will be similar. This means that we can learn 
how to solve the problem off-line. In particular, we observe that the pattern in 
the difference vector i5g will be related to the error in the model parameters. 

During a training phase, the AAM learns a linear relationship between <5g 
and the parameter perturbation required to drive it to zero, <5c = ASg. The 
matrix A is obtained by linear regression on random displacements from the 
known correct model parameters and the induced image residuals (see P] for 
details). 

During search we simply iteratively compute Sg given the current parame- 
ters c and then update the samples using c —>■ c — <5c. This is repeated until 
no improvement is made to the error, |(5gp, and convergence is declared. We 
use a multi-resolution implementation of search. This is more efficient and can 
converge to the correct solution from further away than search at a single reso- 
lution. 

For example, Figure0shows an example of an AAM of the central structures 
of the brain slice converging from a displaced position on a previously unseen 
image. The model represented about 10000 pixels and c was a vector of 30 
parameters. The search took about 330ms on a 450MHz PC. Figure 0 shows 
examples of the results of the search, with the found model points superimposed 
on the target images. 

5 Allowing Local Deformation 

An AAM may not be able to match accurately to a new image if the image 
displays variation not present in the training set, or if the number of model 
modes has been truncated too severely. The latter may arise from the need to 
reduce the storage requirements or search time. In order to match accurately to 
a new image in such a case, we must allow more deformation than that defined 
by the model. 
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Initial 2 its 6 its 16 its original 

Fig. 4. Multi-resolution AAM search from a displaced position 




Fig. 5. Results of AAM search. Model points superimposed on target image 



To allow local deformations, we modify our model of appearance CD by ad- 
ding a vector of deformations to the shape in the model frame, 

X = X -1- QgC + Sx , . 

g = g + QgC 

The allowed local deformations, Sx, are assumed limited to some range about 
zero which can be estimated from the training set, and are assumed to represent 
a smooth deformation from the global model. This assumption of smoothness is 
necessary to regularise the solution, avoiding overfitting the data. 

Valid values of 6x can be generated as follows. Let n be a random vector 
drawn from a unit spherical normal distribution. Let M((Ts) be a (sparse) matrix 
which smooths <5x by taking a weighted average over a neighbourhood (see below 
for its construction). 

Then a plausible set of smooth displacements is given by Sx = aM((Ts)n, 
where a describes the magnitude of the allowed displacements. 

For 2D shapes, let the mean shape vector be of the form 



X = (xi, ...,Xn,yi,-- -,ynf 

Let dfj = - Xj)"^ + (yi - yj)"^. 

We construct the smoothing matrix as 



(M'{as) 0 \ 

0 M'(afy ) 



,Mr{as) 



exp{-d'f^/2a^) 



( 3 ) 



M(a,) 



( 4 ) 
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Since M(cts) is symmetric, the covariance of Sx is thus Ss = a^M((Ts)^. We 
could choose to represent <5x using a subset of the eigenvectors of Ss, Ps, thus 
Sx — Psbs where bg is a set of weights. However, this is a global representation, 
imposing global constaints. The advantage of the former description is that we 
can easily update our current estimate of Sx to take into account small local 
changes by smoothing those changes using M((Js). This can have computational 
advantages. 

5.1 Local AAMs 

In order to determine local deformations we use the AAM search algorithm to 
match image neighbourhoods. For each model point we build a local AAM which 
will be used to determine the displacement of the point from the position defined 
by the global model. To construct a neighbourhood we determine all the model 
pixels within a radius r of the current point in the model frame, and assign them 
a weight using a gaussian kernel of s.d. CTu, centred on the point. For each model 
point, i, let rii be the number of such pixels, and Up be the total number of 
pixels represented by the texture model. Let be an rii x Up sparse matrix 
(with one non-zero element per row whose value is determined by the gaussian 
kernel) which can be used to select the weighted subset of pixel values, gi from 
the texture vector g. 



During search, after converging with the conventional AAM (thus finding 
the optimal pose and model parameters) , we compute the residual texture error 
in the model frame, <5g = gs — gm- For each model point, i, we compute the 
suggested local displacement. 



where A^ is a 2 x matrix (for a 2D model) describing the relationship between 
point displacements and changes in W^Jg, which can be learnt from the training 
set (see below). This gives a suggested local displacement field, which is smoothed 
with M(i7s) before being added to the current estimate of 6x. 

Training Local AAMs 

We assume that we have trained a global appearance model as described in jS|. 
For each model point we determine the set of nearby model pixels that will 
contribute to its movement, and their corresponding weights, W^. 

We then fit the appearance model to each training image, deliberately dis- 
place the points, and learn the relationship between the displacements and the 
induced texture error. 

Specifically, for each training image: 



Si = W,g 



( 5 ) 




( 6 ) 
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1. Fit the global appearance model to the known points 

2. For each of a number of small random point displacements, (5x^ 

a) Displace the model points by (5x^ 

b) Sample the texture vector at the displaced position, gs 

c) Compute the texture error, Jg = gs — g^ 

d) For each model point, i, compute 6gi = WiSg 

At the end of this process, for each point we have a set of m displacements { 
{Sxi,Syi)k } {k = l..m) and a corresponding set of weighted samples, { Sgi^k }• 
In the manner of the original AAM, we use multivariate regression to learn the 
best linear relationship 



However, where there is little structure in the texture error vector, it will 
be difficult to accurately predict the suggested movement. This is equivalent to 
the difficulty of computing accurate optical flow in smooth regions. We use the 
same solution as Thirion CH, and weight the contribution by a measure of the 
amount of local structure. In particular we estimate the relative importance of 
each point using the total sum of variance of the samples for each point. 



We then weight each prediction matrix using its relative importance, ^ 
where Vmax is the largest Vi. 

The models are trained in a multi-resolution framework (for each point we 
have a local AAM at each level) . 

6 Matching Algorithm 

The full multi-resolution matching algorithm is as follows. We assume that we 
start with an initial estimate of the pose and the model parameters, c, at the 
coarsest level. We set i5x = 0. 

For each level, L = Lmax • • ■ Lmin ■ 

1. Use the standard AAM algorithm to update the pose, t, texture transform, 
u, and appearance parameters, c, to fit to the image, (for details, see 0). 

2. Use local AAMs to update (5x : 

For each of niocai iterations: 

a) Sample image using current shape, 7t(x), giving gim 

b) Project sample into texture model frame, gs = T~^{gim) 

c) Compute current model texture, gm 

d) Compute current residual (5g = gs — gm 

e) For each point compute the suggested update = — A^W^Jg 




( 7 ) 



m 




( 8 ) 
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f) Smooth the suggested updates, Sxg = MgJx' 

g) Update current displacements: 6x — (^x + ^x^) 

h) Apply limits to displacements 

3. If L 7 ^ Ljnin then project pose, c and Stc into level L — 1. 

7 Experimental Evaluation 

We have performed a set of systematic experiments to evaluate the new method 
and compare its performance to that of the original global AAM search. 



7.1 Search Using Local AAMs 

In order to test the ability of local AAMs to match a model to an image, we 
trained a model on a single image (Figurel^i) . The global AAM is thus rigid, with 
no modes of shape or texture variation. The AAM algorithm can still be used to 
manipulate the pose, t, finding the best euclidean transformation mapping the 
model to a new image (Figure 03 shows the result of a 3 level multi-resolution 
search). This does not, however, produce a very accurate fit. We then repeated 
the search, allowing 2, 5 and 20 local AAM iterations to deform the shape at 
each level of a three level multi-resolution search. (Figure 0:-e). 

This demonstrates that the local AAM algorithm can successfully deform a 
model to match to an unseen image, giving results similar to those one would 
expect from optical flow or ‘demon’ based methods. 




Training Example 
(a) 



Search result 
(No local it.s) 

(b) 



Search result 
(2 local it.s) 

(c) 



Search result 
(5 local it.s) 

(d) 



Search result 
(20 local it.s) 

(e) 



Fig. 6. Training a model on a single brain example then searching with differing num- 
bers of local AAM updates. With local deformation the otherwise restricted model can 
fit well to the new image. 



Figure Qshows a similar experiment, training on one face image then testing 
on a different one. Because of the shape and texture differences between the 
original and target images, it is difficult to get an entirely accurate map with a 
model trained on only one example. 
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Training Example Search result Search result 

(No local it.s) (2 local it.s) 



Fig. 7. Training a model on a single face example then searching with differing numbers 
of local AAM updates. Without allowing local deformation, the model cannot fit well 
to the new image. 



7.2 Accuracy Improvements Using Local AAMs 

We wished to quantify the effect on search accuracy of adding local AAMs to 
the global AAM algorithm. We used two data sets. One contained 68 MR brain 
scan slices, marked with various sub-cortical structures (Figure Q. The other 
contained 400 images of 20 different faces, each with different expressions (Figure 

0 ). 

For each data set we built models from random subsets and used them to 
locate the structures of interest on the remaining images. The number of mo- 
del parameters was chosen so as to explain 95% of the variance observed in the 
training set. In each case a 3-level multi-resolution search was performed, star- 
ting with the mean model parameters, with the model displaced from the true 
position by ±5 pixels. On completing the search we measured the RMS texture 
error between the model and the target image, and the average point position 
error, measured as the distance from the model point to the boundary on which 
the point should lie. In the latter case the target boundaries were defined by the 
original hand-landmarking of the images. 

The experiments were performed with a global AAM (allowing no local de- 
formation) and with the modified AAM, allowing 5 local deformation iterations 
at each resolution. 

Figure 0 shows frequency histograms for the position and texture errors for 
models trained from a single brain example. The results suggest that allowing 
local deformation improves both position and texture matching. Figure 0 shows 
equivalent histograms for models trained on ten examples, again showing impro- 
vements. 

Figure E3 shows graphs of mean error against the number of examples used 
in the training set for the brain data. In some cases the search diverged - these 
examples were detected by setting a threshold of 10 pixels on the mean point- 
to-point error - and they were not included in the calculation of the mean. The 
graphs demonstrate the use of local AAMs gives a significant improvement in 
both point location accuracy and the quality of the synthesised texture. (Stan- 
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Point-Curve Error (plicels) Texture Error 



Fig. 8. Frequency histograms showing position and texture error distributions for mo- 
dels trained from a single example of a brain 





Fig. 9. Frequency histograms showing position and texture error distributions for mo- 
dels trained from ten brain images 



dard errors on the measurements of the means were too small to be usefully 
included on the graphs). 

Notice that for a model trained on a single example, allowing local deforma- 
tion significantly improves the match. This is equivalent to various non-linear 
image registration schemes. However, we can achieve better results using a global 
AAM (with non local deformation) trained on a small number of examples. The 
best results are achieved when local and global models are combined together. 

Figure m shows the equivalent results for the face data. The local AAM 
again improves the results, but less markedly. 

Figure El shows typical examples of the mean times per search for differing 
numbers in the training sets. The experiments were performed on a 550MHz 
PC. Naturally, allowing local deformation increases the search times, though the 
matching algorithm is still fast. The extra time is linearly dependent on the 
number of local iterations used (in this case 5) and the number of pixels in the 
support region for each model point (in this case a disc of radius 5 pixels). 



7.3 Varying the Model Modes 

Figure El shows the point-to-curve and texture errors as a function of the num- 
ber of model modes retained when building a global model from subsets of 50 
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Fig. 10. Plots of mean texture error and point position error for models trained on 
different subset sizes from the brain data set 




Fig. 11. Plots of mean texture error and point position error for models trained on 
different subset sizes from the face data set 



brain images and testing on the remainder. The original AAM performance in- 
creases as the number of model modes increases. For the local AAM the allowed 
movement of the points was estimated from the residual point position variance 
of the training set not explained by the model modes used. The local AAM gives 
significantly better results for all choices of numbers of modes. The performance 
reaches a peak at around 20 modes, after which it begins to deteriorate slightly, 
probably due to the model becoming over-specific. Clearly graphs of this sort 
can be used to estimate the optimal numbers of modes to use for a model. 

Figure d shows the results of equivalent experiments for the face model. 
Again, these show that the additional local flexibility leads to improvements for 
all choices of number of modes, and that increasing the number of global modes 
beyond about 50 can decrease overall performance. 

7.4 Choice of Parameters for Local AAMs 

There are two significant parameters for the local models, the size of their sup- 
port region (the radius of the region sampled about each point, r) and the stan- 
dard deviation of the gaussian kernel used to smooth the local displacements, 
as- We performed search experiments with various radii of sampled regions. We 
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Times for brain search 



Times for face search 



Fig. 12. Plots of mean time of search for brain and face experiments 




Number of Model Modes Number of Model Modes 



Fig. 13. Plot of error against number of brain model modes used 



found that for the brain data the results were relatively insensitive to radii in 
the range r = 2. .10 pixels. 

Figure uni shows plots of point-to-boundary error for different values of the 
standard deviation of the local displacement smoothing kernel, as, for different 
numbers of training examples. The quality of fit achieved reaches an optimum at 
about tTs = 5 pixels but again is relatively insensitive to variation in this value. 
More smoothing leads to a degradation, as the model is less able to fit to small 
changes. With less smoothing, presumably the model becomes too sensitive to 
noise. 

8 Discussion and Conclusions 

We have demonstrated how the Active Appearance Model can be extended to 
allow smooth local deformation. Such models can fit more accurately to unseen 
data than models using purely global model modes or purely local modes. The 
local AAMs are trained by learning the relationship between point displacements 
and induced image error (rather than relying on image derivatives), and can 
match quickly and accurately. 
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Number of Model Modes Number of Model Modes 



Fig. 14. Plot of error against number of face model modes used 




Fig. 15. Plot of positional error against SD of local smoothing, CTs 



The model points used could be a dense grid, rather than the sparce land- 
marks shown above. We have experimented with this approach. We found it to 
be slower and no more accurate than using the sparse representation. This is 
because the landmarks have been deliberately chosen to be those which best de- 
scribe the shape, and tend to be on structures which can easily be tracked. Were 
a dense grid used, it would have many points in smooth regions which cannot be 
accurately tracked - they must be interpolated from nearby structured regions. 

We intend to explore using the local AAMs to help automatically build mo- 
dels from sets of images, using a bootstrapping algorithm similar to that descri- 
bed by Jones and Poggio jH). 

Allowing local model deformation means that more parameters are required 
to describe the model state. However, this is justified if a more accurate model 
fit can be achieved. 

The modelling and matching algorithms described above can be considered 
as a combination of both global appearance model type algorithms and pair- 
wise (local-deformation based) image registration algorithms, and should lead 
to more robust and accurate solutions in both problem domains. 
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Abstract. This paper describes an extension of a techniqne for the re- 
cognition and tracking of every day objects in cluttered scenes. The goal 
is to build a system in which ordinary desktop objects serve as physical 
icons in a vision based system for man-machine interaction. In such a 
system, the manipulation of objects replaces user commands. 

A view-variant recognition technique, developed by the second author, 
has been adapted by the first author for a problem of recognising and 
tracking objects on a cluttered background in the presence of occlusions. 
This method is based on sampling a local appearance function at discrete 
viewpoints by projecting it onto a vector of receptive fields which have 
been normalised to local scale and orientation. This paper reports on the 
experimental validation of the approach, and of its extension to the use 
of receptive fields based on colour. The experimental results indicate that 
the second author’s technique does indeed provide a method for building 
a fast and robust recognition technique. Furthermore, the extension to 
coloured receptive fields provides a greater degree of local discrimination 
and an enhanced robustness to variable background conditions. 

The approach is suitable for the recognition of general objects as physical 
icons in an augmented reality. 



Keywords: Object Recognition, Texture & Colour, Appearance-Based Vision, 
Phicons 

1 Introduction 

This article addresses the problem of the recognition of objects with a wide 
variety of features under changing background conditions. The proposed system 
is to be used in the context of an augmented reality system. In this system, 
physical icons (phicons) are used to enhance the man-machine interface. Physical 
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icons are physical objects to which a virtual entity can be attached. Such a 
virtual entity can represent system commands and their parameters USEI. A 
classical example for the use of phicons are editing operations. An eraser can 
stand for deleting, scissors can stand for cutting, and tape can stand for pasting. 
An appropriate selection of phicons allow users to quickly adapt to the graspable 
interface. Our problem is to build such a system to investigate the improvement 
in usability provided by phicons. 

In an augmented reality system, one or more cameras observe a region of 
interest in which interaction can take place. Such a region can be a desk or more 
general a three dimensional space within a room. In such an environment the 
background and the lighting is variable. Translation of objects invoke differences 
in the view point of the camera and object pose. These problems require a system 
that is robust to such differences and make the recognition and pose estimation 
of phicons in an augmented reality an interesting challenge for computer vision. 

An important constraint in a phicon based interface is that the user may sel- 
ect the object which serve as his personal interface. This imposes the constraint 
that the computer vision system can not be engineered for specific classes of ob- 
jects. The system must be completely general. In addition, the computer vision 
system must not interfere with natural interaction. Thus the vision system must 
have a very low latency (on the order of 50 milliseconds in the case of tracking), 
and a very low failure rate. 

The acceptance of objects with a wider variety of features increases the dif- 
ficulty of recognition and pose estimation. Although there already exist many 
different approaches, most established methods work well for restricted classes 
of objects. 

In this article an approach is proposed that allows the view-variant recogni- 
tion of objects in a desk-top scene observed with a steerable camera. A possible 
solution could be provided by colour histograms However, this approach 

is not suitable for pose estimation. The extension to pose estimation in 2D and 
3D is an important factor for the design of the approach. For this reason receptive 
fields are preferred to colour histograms. 

The second author P| has recently demonstrated a technique for the reco- 
gnition of objects over changes in view-point and illumination which is robust 
to occlusions. In this approach, local scale and orientation are estimated at each 
point in an image. A vector of receptive fields is then normalised to this scale and 
orientation. The local neighborhood is projected onto this vector. This provides 
a representation which can be used by a prediction-verification algorithm for 
fast recognition and tracking, independent of scale an image orientation. View 
invariant recognition is obtained by sampling this representation at regular in- 
tervals over the view sphere. Because the method uses local receptive fields, it 
is intrinsically robust to occlusions. 

In this article we adapt this technique to the problem of recognising and 
tracking physical icons. The technique extended by employing coloured receptive 
fields. The proposed approach allows the recognition of a wide variety of common 
objects, including objects with features that make recognition difficult, such as 
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specularity and transparency. Evaluation of the experiments show that good 
results are obtained, even in an environment with variable background. 

The next section reviews the description of the local appearance function 
by projection onto normalised receptive fields vectors. We then describe how 
this approach can be extended to coloured receptive fields. We then provide 
experimental results which validate the second author‘s approach using grey 
scale images, and then demonstrate the contribution of colour. 

2 Describing Local Appearance 

In 1991 Adelson and Bergen [2| reported a function that derives the basic visual 
elements from structural visual information in the world. This function is called 
the plenoptic function (from “plenus”, full or complete, and “opticus”, to see). 
The plenoptic function is the function of everything that can be seen. In machine 
vision the world is projected onto an image, which is a sample of the plenoptic 
function: 

P{x,y,t,X,V,,Vy,V,) (1) 

where (a:, y) are the image coordinates, t, the time instant, A the response wa- 
velength, and {Vx, Vy, 14) the view point. If the plenoptic function for an object 
is known it would be possible to reconstruct every possible image of the object; 
that is from every possible view, at every moment, for every image pixel, at every 
wavelength. 

Adelson and Bergen propose to analyze samples of the plenoptic function 
using low order derivatives as feature detectors. Koenderink jS| expands the 
image signal by the first terms of its Taylor decomposition, that is in terms of 
the derivatives of increasing order. The vector of this set is called “Local Jet”. 
The Local Jet is known to be useful for describing and recognising local features 
P. The signal derivatives are obtained by convolution of the signal by a set of 
basis functions. 



2.1 Gaussian Derivatives 



Gaussian derivatives provide a basis for a Taylor series expansion of a local 
signal. This means that a local image neighborhood can be reconstructed by a 
linear combination of weighted Gaussian derivative filters. This reconstruction 
becomes an approximation which increases in error as the number of filters is 
reduced. The formula for the ID Gaussian derivative with respect to the 
dimension, x, is: 



Sx’^g{x,a) 



(Pg(x, a) 
dx” 




with (/(x, cr) 






( 2 ) 



where stands for the Hermite “type e” polynomials p. 
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Gaussian derivatives have an explicit scale parameter, a, and can though be 
generated at any scale. With steerable filters proposed by Freeman P Gaussian 
derivatives can be oriented in any arbitrary direction. With automatic scale 
selection |Sj the local scale of a feature can be determined. The object in an image 
can be normalised by scale which allows recognition under scale changes. The 
determination of the dominant orientation of a neighborhood allows to normalise 
by orientation. These two properties are used by all techniques presented in this 
article. 



3 Sampling Local Appearance 

In the technique proposed by Golin de Verdiere 0 a training set consists of all 
overlapping image neighborhoods, referred to as imagettes, of all model images. 
An imagette is projected onto a single point in the descriptor space R. Each 
model image can be represented as a grid of overlapping imagettes. The projec- 
tions of these imagettes form a surface, a local appearance grid, which models 
the local appearance of the image in R (see figure 0 ). 




Fig. 1. An image as a surface in a subspace of R 



Each object is represented by a set of images from different view points. As 
every image results in a local appearance grid, each object is modeled by the set 
of surfaces in R. The recognition process equals the search of the corresponding 
surface for the projection of a newly observed imagette. The basis of all surfaces 
in R are stored in a structural way, so that the searched surface can be obtained 
by table lookup. The resulting surface contains information about the object 
identity, the view point of the camera and information about the relative location 
of the imagette to the object position. The information from several points allow 
to estimate the pose of the object. 
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The approach based on Gaussian derivatives proposed in 0 serves as bench- 
mark for the evaluation of the results. This approach is fast due to efficient 
storage and recursive filters H3, rotation invariant due to steerable filters ||, 
invariant to scale due to automatic scale selection 0, and robust to occlusions 
due to receptive fields. It produces good results for compact textured objects 
(see section lOl) . The approach fails completely for objects with sparse texture 
or objects of small sizes or with holes. The reason is that the Gaussian deri- 
vatives are computed only from the luminance image. In the luminance image 
the structure is very well preserved but the chromatic information is lost, and 
thereby the ability to distinguish objects by their colour. Small or non compact 
objects can not be recognised because the imagette contains part of the varia- 
ble background. If the portion of the background is important the imagette is 
projected on a different point within the descriptor space. The detection of a 
surface belonging to another object or no surface at all is possible. 

The approach described in this section serves as a starting point for the de- 
velopment of an improved approach. For the discrimination of poorly structured 
objects, chromatic information is indispensable. In the case of other objects, 
chrominance improves discrimination. A system that employs structural and 
chromatic information describes an additional dimension of the plenoptic func- 
tion. Because this dimension includes more information, it can be expected to 
produce superior recognition results, at the cost of increased computation. Most 
of the additional cost may be avoided by keeping the number of receptive fields 
constant. We compensate the addition of receptive fields for chrominance with 
a reduction in the number of receptive fields for higher order derivatives. Our 
experiments show that chrominance is more effective than third order derivatives 
in discrimination of local neighborhoods. 

4 Coloured Receptive Fields 

A new descriptor space is needed that is based on Gaussian derivatives and capa- 
ble of processing colour images. A direct approach would be to filter each colour 
channel separately. The advantage would be that no information is lost and no 
new technique needs to be developed. The disadvantage is that the normalisation 
process would need to be duplicated independently for each colour channel. 

An alternative is to maintain the use of the luminance channel, and to com- 
plement this with two channels based on chrominance. The chrominance chan- 
nels are described using colour-opponent receptive fields. Luminance is known 
to describe object geometric structure while chrominance is primarily useful for 
discrimination. Thus a receptive field vector is used in which chrominance recep- 
tive fields are normalised with the scale and orientation parameters computed 
from the luminance channel. 

4.1 Selection of an Appropriate Colour Space 

This section addresses the problem of designing the colour opponent receptive 
fields for chrominance. 
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The RGB coordinate system is transformed according to following transfor- 
mation 
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This transformation, illustrated in figure El moves the origin to the center of 
the colour cube. One axis corresponds to the luminance axis, which will be 
used for structure analysis. The other two axis are orthogonal to the luminance 
axis and are used for colour analysis. We note that the two axis coding colour 
information are sensitive to red green differences and blue yellow differences, 
inspired by models of the human visual system (Z). 




Fig. 2. Transformation of the RGB coordinate system. 



Projection of the image neighborhood onto the luminance axis provides a 
description of geometric structure. Projection onto the colour difference channel 
improves discrimination. 

5 Experimental Results 

The experiment is based on 8 ordinary objects form an office desktop, that are 
appropriate to serve as physical icons (shown in figure 0|). This set of objects 
is used to demonstrate the capability of the approaches to cope with general 
objects, among them objects with difficult features. The set contains textu- 
red and uniform objects, compact objects and objects with holes, specular and 
transparent objects. Some of the objects can be discriminated easily by their 
structure (eraser, sweets box), or by their colour (pen, scissors). Other objects 
exhibit specularities and transparencies which would render most object reco- 
gnition techniques unreliable (tape, pencil sharpener, protractor). Recognition 
of such objects is difficult, because small changes of illumination or background 
conditions invoke important changes in the appearance of these objects. 



170 



D. Hall, V. Colin de Verdiere, and J.L. Crowley 
























(0) eraser (1)pen (2) scissors (3) stapler 




(4) tape 



(5) sharpener (6) protractor 



(7) sweets box 



Fig. 3. Object set used in the experiments. 




Fig. 4. Test scenes used in the experiments. 
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For imagettes at variable scales, extracted from images of objects in the real 
world, object background regions tend to be generally lighter or generally darker 
than the object. While clutter can introduce mixed backgrounds, such cases tend 
to be rare. In order to assure recognition over a range of backgrounds, we train 
models placing our objects on both black background and white background 
during training. Figure^shows the training images on white background. Section 
Fi. II r -!l lb.:;i applies the technique to images with uniform background, section 
1.1 41 shows results on cluttered background. 

The training phase results in a separate data structures for purely luminance 
based receptive field vectors up to third order, and for a receptive field vector 
which includes both luminance and chrominance, but are limited to second or- 
der. A recognition cycle was run on the test images. A set of 15 test images are 
used that contain between 2 to 6 different objects of the test set (see Figure 0. 
The orientation and the position of the objects in the test images is different 
from the orientation and position in the training images. The distance from the 
camera is constant and the camera is pointing on the desk. A grid of image 
neighborhood locations were selected for evaluation using a step size of 5 pixels 
between neighborhoods. At each neighborhood, the local scale and orientation 
are determined. The local neighborhood is then projected onto a vector of recep- 
tive fields which has been normalised to this scale and orientation. The vector 
was then used as an index to generate a list of hypotheses for possible objects 
and image neighborhoods having similar appearance. 

For recognition the hypothesis list of the current test point is evaluated. 
No previous knowledge is used. We point out that the performance of the sy- 
stem can be increased by combining hypotheses with previous knowledge in a 
prediction-verification algorithm. Comparing the recognition results based on 
the hypotheses of one single point only gives more precise information about 
the precision and reliability of the different approaches. The approach can be 
generalised to recognition under different view points by including images from 
sample points along the view sphere in the training set. 

For each neighborhood, the method produces a sorted list of image neigh- 
borhoods from all the trained objects with a similar appearance. Similarity in 
appearance is determined by the distance between the vector of responses to 
the receptive fields. A list of neighborhoods within a tolerance distance (epsilon) 
are returned. This list is sorted by similarity. If the list is too large, then the 
neighborhood is judged to be non-discriminant and is rejected. Similarly, if no 
neighborhoods are found within a tolerance, the neighborhood is judged to be 
unstable, and is rejected. Neighborhoods for which a small number of similar 
matches are found are labeled as “accepted” in the experiments below. 

The recognition rates must be seen in combination with the acceptance rate. 
The goal is to obtain high acceptance rates together with high recognition rates. 
Thus, to evaluate the results of the techniques, three values are presented. First, 
the percentage of neighborhoods that produced a hypothesis are displayed. The 
number of such neighborhoods is labeled as the “acceptance rate” . This is the 
percentage of neighborhoods which are both unambiguous and stable. Secondly, 
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we display the number of neighborhoods for which the most similar recalled 
neighborhood is from the correct object. These cases are labeled “1st answer 
correct” . A third value presents the number of returned neighborhoods for which 
the correct object and neighborhood was in the best three returned neighbor- 
hoods (correct answer among first 3). Such slightly ambiguous neighborhoods 
can be employed by a prediction-verification algorithm for recognition. All va- 
lues are average values over the test scenes with uniform background (section 
1.1-11 II. 21 1.1 ..Sll or over the test scenes with cluttered background (section I1.4ji . 



5.1 Local Appearance Technique Based on Luminance 



This experiment is computed on luminance images according to the technique de- 
scribed in section 0using recursive filters, automatic scale selection, and steerable 
filters. This experiment is the benchmark for the following experiments. 



Table 1. Results of technique based on luminance receptive fields. Neighborhoods of 
objects with discriminant structure are easily recognised. However, luminance provides 
poor discrimination for uniform and specular objects. 



object number 


0 


1 


2 


3 


4 


5 


6 


7 


acceptance rate 


0.30 


0.41 


0.65 


0.04 


0.47 


0.54 


0.07 


0.23 


1st answer correct 


0.40 


0.27 


0.62 


0.59 


0.28 


0.12 


0.91 


0.43 


correct answer 

among first 3 


0.77 


0.51 


0.83 


0.82 


0.62 


0.47 


1 


0.81 



Neighborhoods from objects eraser (0), pen (1), scissors (2), tape (4), sharpe- 
ner (5) and sweets box (7) produce good acceptance rates. The acceptance rates 
for neighborhoods from the stapler (3) and protractor (6) are very low which in- 
dicates that for most of the observed neighborhoods are unstable or ambiguous. 
The recognition rates for these objects are thus based on an insufficient number 
of windows and should not be considered to judge the accuracy of this particular 
experiment. These two objects are very hard to recognise by a system using only 
luminance. 

Objects eraser (0), scissors (2) and sweets box (7) produce sufficiently high 
recognition rates and a simple voting algorithm could be used for recognition. 
A prediction-verification approach would produce a robust recognition for these 
objects, as reported by Colin de Verdiere Poor results for recognising neig- 
hborhoods are obtained for objects pen (1), tape (4) and sharpener (5). These 
objects are either uniform or specular, which makes the recognition using only 
luminance difficult. 
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5.2 Coloured Receptive Field Technique Using 0th Order Gaussian 
Derivative in Colour Channels 

In this experiment two chrominance channels are added to the receptive field 
vector. These two axis, which are orthogonal to the luminance axis, are encoded 
with a 0*^ order Gaussian with size a as determined by local normalisation. 
These two channels capture information about the chrominance in the neighbor- 
hood of each point. This provides good recognition rates for structural objects 
in the previous experiment as well as a large improvement in acceptance and 
recognition for the constant and specular objects. 



Table 2. Results of technique extended to 0*^ order Gaussian derivative in chrominance 
channels. High recognition rates are obtained for all objects, although the acceptance 
rate for transparent objects remains low. 



object number 


0 


1 


2 


3 


4 


5 


6 


7 


acceptance rate 


0.82 


0.78 


0.86 


0.97 


0.86 


0.87 


0.19 


0.91 


1st answer correct 


0.87 


0.93 


0.79 


0.96 


0.66 


0.74 


1 


0.99 


correct answer 

among first 3 


0.94 


0.98 


0.91 


1 


0.88 


0.96 


1 


1 



The addition of chrominance information raised the acceptance rates from 
an average of 0.34 in the previous experiment to an average of 0.78. Many fewer 
neighborhoods are rejected because of ambiguous or unstable structure. Figure 
El illustrates the decrease of the number of ambiguous windows using grey scale 
and coloured receptive fields. This is an important improvement because even 
for difficult objects many windows produce a result, which was not the case in 
the previous experiment. The only object with a low acceptance rate is object 
protractor (6), which is transparent and particularly difficult to describe. 

Very good recognition rates are obtained for all objects. The lowest first 
answer recognition rates are obtained for objects tape (4) and sharpener (5). 
These objects are highly specular and thus change their appearance with pose 
and illumination. Even for these objects the recognition rates are sufficiently 
high that a simply voting scheme could be used for recognition in restricted 
domains. 



5.3 Coloured Receptive Field Technique Using 0th and 1st Order 
Gaussian Derivatives in Colour Channels 

In this experiment the chrominance information is extended to the first deriva- 
tives in order to capture colour gradients that are characteristic for the object. 
The structure analysis is performed in the 1®* and 2"'^ order derivatives. The 3’''^ 
order derivative is abandoned, because its analysis is only interesting when the 
2 nd derivative is important [5|. The descriptor space has than 8 dimension 
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(a) (b) 



Fig. 5. White points mark non-ambiguous (accepted) windows, (a) accepted windows 
for grey scale receptive fields, (b) accepted windows for coloured receptive fields. 



which helps to avoid the problems that occur in high dimensional spaces. The 
comparison of tableland table 0 validates that the improvement by using colour 
is much superior to the loss in structure recognition by abandoning the 3’’“^ order 
derivative. 



Table 3. Results of technique extended to 0*^ and 1®* order Gaussian derivatives in 
chrominance channels. High recognition rates are obtained for all objects. Average 
results are slightly superior than those in section 



object number 


0 


1 


2 


3 


4 


5 


6 


7 


acceptance rate 


0.88 


0.87 


0.91 


0.98 


0.83 


0.98 


0.23 


0.99 


1st answer correct 


0.91 


0.98 


0.86 


0.97 


0.74 


0.77 


0.96 


1 


correct answer 

among first 3 


0.98 


0.99 


0.94 


0.99 


0.90 


0.97 


0.99 


1 



The acceptance rates are in the average higher than in the previous experi- 
ment. The acceptance rate for object protractor (6) is still relatively low. The 
recognition rates are slightly superior to the recognition rates obtained previous 
experiment. This shows that colour gradient holds information which improves 
the discrimination of objects. 

5.4 Experiments on Cluttered Background 

The benchmark technique (section |SI) produces very low recognition rates (table 
0. We obtain a mean of 0.1238 in the first answers, which is even worse than 
guessing. This means that the background introduces new structures that were 
not present in the training base. These structures are so important that a correct 
classification is very difficult. 
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Table 4. Results for objects on clnttered backgronnd using technique based on lumi- 
nance images. Very low recognition rates are observed. Object recognition is diflicnlt. 



object number 


0 


1 


2 


3 


4 


5 


6 


7 


acceptance rate 


0.04 


0.52 


0.44 


0.33 


0.59 


0.92 


0.54 


0.25 


1st answer correct 


0 


0.15 


0.39 


0 


0.03 


0 


0.17 


0.25 


correct answer 

among first 3 


1 


0.42 


0.70 


0 


0.17 


0.15 


0.27 


0.64 



Table 5. Resnlts for objects on cluttered background obtained by technique extended 
to 0th order Gaussian derivative in colour channels. Few windows are rejected. Object 
recognition is possible. 



object number 


0 


1 


2 


3 


4 


5 


6 


7 


acceptance rate 


0.65 


0.71 


0.51 


0.80 


0.70 


0.81 


0.53 


0.86 


1st answer correct 


0.91 


0.80 


0.50 


0.34 


0.70 


0.35 


0.29 


0.94 


correct answer 

among first 3 


0.94 


0.90 


0.61 


0.68 


0.87 


0.42 


0.31 


0.98 



Table 6. Results for objects on cluttered background with technique extended to 
0th and 1st order Gaussian derivatives in colour channels. High acceptance rates are 
observed. Object recognition is possible. 



object number 


0 


1 


2 


3 


4 


5 


6 


7 


acceptance rate 


0.75 


0.60 


0.50 


0.79 


0.60 


0.14 


0.41 


0.96 


1st answer correct 


0.85 


0.82 


0.58 


0.19 


0.77 


0.27 


0.39 


0.99 


correct answer 

among first 3 


0.93 


0.86 


0.64 


0.39 


0.84 


0.55 


0.43 


0.99 
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In the case of a cluttered background, the use of chrominance provides a dra- 
matic improvement in recognition rates. Objects eraser (0), pen (1), tape (4) and 
sweets box (7) have high recognition rates together with high acceptance rates 
which allow a reliable classification. There are problems with objects scissors (2), 
stapler (3), sharpener (5) and protractor (6) either due to low acceptance rates 
or low recognition rates. For transparent objects such as the object protractor 
(6) this is expected, because it depends very much on the background condi- 
tions. Objects scissors (2), stapler (3) and sharpener (5) are either small, thin 
or have holes. This means a large amount of neighborhoods contain background 
information which perturbs the classification. 

Another interesting observation is that in the case of background clutter, 
the acceptance rates using only the 0*^ order Gaussian in the colour channel 
are slightly higher to the acceptance rates obtained by the technique using the 
0*^ and order derivative in the colour channels. The cluttered background 
contains a large set of colours, which are not present in the training base. This 
variety leads to colour gradients at boundaries which have not been observed in 
training and are thus rejected. Receptive fields based on the 0*^ order Gaussian 
are much less sensitive to such background distraction. This is interesting because 
on uniform background the technique using the colour gradient has been found 
superior to the technique using only the 0*^ order Gaussian. 

6 Conclusions 

The results presented in this article are incremental and primarily experimen- 
tal. We have experimentally investigated the extension of the technique of 0 to 
the problem of real time observation of the physical icons for computer human 
interaction. Gertain characteristics of real world objects, such as specularity, 
transparency or low structure, variable background and changing camera posi- 
tions make the identification of objects difficult. 

The recognition technique evaluated in this article employs local orientation 
normalisation to provide invariance to image plane rotations. Robustness to scale 
changes are provided local normalisation using automatic scale selection. The 
technique can be implemented to operate in real time by recursively computing 
separable Gaussian filters. Such filters are steered to the local orientation using 
the steerability property of Gaussian derivatives. Training was performed for 
the grey scale technique in 237s on a Pentium II 333 MHz. The techniques using 
colour needed both 278s for 16 training images of average size of 39 212 pixels. 

The benchmark technique produces satisfactory recognition results on uni- 
form background. It can clearly be stated that structured objects exhibit higher 
classification rates. The approach fails for uniform objects, because of a lack 
of structure. A pure luminance based approach also has problems with difficult 
objects, such as transparent or specular objects. Recognition rates for cluttered 
background using only luminance are below chance. 

The method is extended by the addition of chrominance information. An 
chrominance descriptor space is presented that can describe colour images and 
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does not increase the dimensionality greatly in comparison to the starting point 
technique. Problems with high dimensional spaces are avoided. A system is ob- 
tained that preserves the advantages of the pure luminance approach and is 
capable of classifying a much wider range of objects. It is not significantly more 
expensive in computation and storage. The experimental section validates that 
objects with difficult features can be recognised, even on cluttered background. 
It also indicates that chrominance is more important to recognition than higher 
order derivatives. 

We are currently working to extend the approach to view-variant pose esti- 
mation. Recognition under different view points is obtained by including images 
taken under different view points in the training base. The object pose will be 
estimated by using the geometrical information of the results from several re- 
cognised points. A more robust pose estimation will be obtained by using a 
prediction-verification algorithm. The result should be a system with high pre- 
cision, robustness and reliability. 
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Abstract. A novel algorithm is proposed to learn pattern similarities 
for texture image retrieval. Similar patterns in different texture classes 
are grouped into a cluster in the feature space. Each cluster is isolated 
from others by an enclosed boundary, which is represented by several 
support vectors and their weights obtained from a statistical learning 
algorithm called support vector machine (SVM). The signed distance of 
a pattern to the boundary is used to measure its similarity. Furthermore, 
the patterns of different classes within each cluster are separated by se- 
veral sub-boundaries, which are also learned by the SVMs. The signed 
distances of the similar patterns to a particular sub-boundary associated 
with the query image are used for ranking these patterns. Experimental 
results on the Brodatz texture database indicate that the new method 
performs significantly better than the traditional Euclidean distance ba- 
sed approach. 

Keywords: Image indexing, learning pattern similarity, boundary di- 
stance metric, support vector machines. 



1 Introduction 

Image content based retrieval is emerging as an important research area with 
application to digital libraries and multimedia databases 1^ IH] uni m Texture, 
as a primitive visual cue, has been studied for over twenty years. Various techni- 
ques have been developed for texture segmentation, classification, synthesis, and 
so on. Recently, texture analysis has made a significant contribution to the area 
of content based retrieval in large image and video databases. Using texture as 
a visual feature, one can query a database to retrieve similar patterns based on 
textural properties in the images. 

In conventional approach, the Euclidean or Mahalanobis distances 0 bet- 
ween the images in the database and the query image are calculated and used 
for ranking. The smaller the distance, the more similar the pattern to the query. 
But this kind of metric has some problems in practice. The similarity measure 
based on the nearest neighbor criterion in the feature space is unsuitable in many 
cases. This is particular true when the image features correspond to low level 
image attributes such as texture, color, or shape. This problem can be illustrated 
in Fig. n(a), where a number of 2-D features from three different image clusters 
are shown. The retrieval results corresponding to query patterns “a” and “b” are 
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Fig. 1. (a). Examples of 2-D features of three different clusters: the circles belong to 
cluster 1, the balls belong to cluster 2, and the squares belong to cluster 3. Three 
points a, b and c are from cluster 1. (b). A nonlinear boundary separates the examples 
of cluster 1 from cluster 2 and 3. 



much different. In addition, using Euclidean distance measures for the nearest 
neighbor search might retrieve patterns without any perceptual relevance to the 
original query pattern. 

In fact, above problem is classical in pattern recognition, but not much effort 
has been made to address these issues in the context of image database browsing. 
Ma and Manjunath Pj present a learning based approach to retrieve the similar 
image patterns. They use the Kohonen feature map to get a coarse labeling, 
followed by a fine-tuning process using learning vector quantization. However, 
the performance of their learning approach is not good when evaluated by the 
average retrieval accuracy (see Fig. 6-2 on page 108 of jS|). In addition, there 
are many parameters to be adjusted heuristically and carefully for applications. 

Similarity measure is the key component for content-based retrieval. San- 
tini and Jain IE! develop a similarity measure based on fuzzy logic. Puzicha 
et al. compare nine image dissimilarity measures empirically, showing that 
no measure exhibits best overall performance and the selection of different mea- 
sures rather depend on the sample distributions. In this paper, we propose a 
new metric called boundary distance to measure pattern similarities, which is 
insensitive to the sample distributions. The basic idea here is that a (non-linear) 
boundary separates the samples belonging to a cluster of similar patterns with 
the remaining. This non-linear boundary encloses all similar patterns inside. In 
Fig. in (b), a non-linear boundary separates all samples in cluster 1 with others 
belonging to cluster 2 and 3. The signed distances of all samples to this nonlinear 
boundary are calculated and used to decide the pattern similarities. This non- 



180 



G. Guo, S.Z. Li, and K.L. Chan 



linear boundary can be learned from some training examples before we construct 
an image database. 

How to learn the boundary? We argue that an appropriate similarity lear- 
ning algorithm for application in content based image retrieval should have two 
properties: 1) good generalization; 2) simple computation. The first one is a com- 
mon requirement for any learning strategy. While the second is very important 
for large image database browsing. 

A statistical learning algorithm called support vector machine (SVM) [TTlj . 
is used in our learning approach. The foundations of SVM have been developed 
by Vapnik m- The formulation embodies the Structural Risk Minimization 
(SRM) principle, which has been shown to be superior to traditional Empirical 
Risk Minimization (ERM) principle employed by conventional artificial neural 
networks 0. SVMs were developed to solve the classification and regression 
problems m ii> and has been used recently to solve the problems in computer 
vision, such as 3D object recognition m, face detection cn, and so on. 

We adapt the SVMs to solve the image retrieval problem. The major dif- 
ference from the conventional utilization of SVMs is that we use the SVMs to 
learn the boundaries. 

The paper is organized as follows. In Section 2, we describe the basic theory 
of SVM and its use for discriminating between different clusters. In Section 3, 
we present the techniques for learning image pattern similarity and ranking the 
images. Section 4 evaluates the performance of the new method for similar image 
retrieval. Finally, Section 5 gives the conclusions. 

2 Cluster Discrimination by Support Vector Machines 

2.1 Basic Theory of Support Vector Machines 

Consider the problem of separating the set of training vectors belonging to two 
separate classes, (xi, yi), . . . , (x;, y;), where x^ G i?”, a feature vector of dimen- 
sion n, and y^ G {— 1,4-1} with a hyperplane of equation wx 4-6 = 0. The set 
of vectors is said to be optimally separated by the hyperplane if it is separated 
without error and the margin is maximal. In Fig. El (a), there are many possible 
linear classifiers that can separate the data, but there is only one (shown in Fig. 
El(b)) that maximizes the margin (the distance between the hyperplane and 
the nearest data point of each class). This linear classifier is termed the opti- 
mal separating hyperplane (OSH). Intuitively, we would expect this boundary 
to generalize well as opposed to the other possible boundaries shown in Fig. El 
(a). 

A canonical hyperplane m has the constraint for parameters w and 6: 
minxi yi(w • x* 4- 6) = 1. 

A separating hyperplane in canonical form must satisfy the following con- 
straints. 



yi [(w • Xi) 4- 6] > 1, i=l,...,l 



( 1 ) 
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Fig. 2. Classification between two classes using hyperplanes: (a) arbitrary hyperplanes 
1, m and n; (b) the optimal separating hyperplane with the largest margin identihed 
by the dashed lines, passing the two support vectors. 



The margin is according to its definition. Hence the hyperplane that 
optimally separates the data is the one that minimizes 

^(w) = ^ II w f (2) 

The solution to the optimization problem of under the constraints of dU 
is given by the saddle point of the Lagrange functional, 

L(w, 6, a) = i II w f ~'^ai {y, [(w • x,) + &]-!} (3) 

i=l 

where ai are the Lagrange multipliers. The Lagrangian has to be minimized 
with respect to w, b and maximized with respect to > 0. Classical Lagrangian 
duality enables the primal problem O to be transformed to its dual problem, 
which is easier to solve. The dual problem is given by. 



max W (a) = max < min L(w, 6, a) 
oc I 

The solution to the dual problem is given by, 



i i 



a = arg mm 






aiUjyiyjXi ■ 



i=i j=i 



with constraints Oi > 0, i = 1, . . . ,1, and = 0- 



( 4 ) 

( 5 ) 
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Solving Equation (0 with constraints determines the Lagrange multipliers, 
and the OSH is given by, 



6 = --W • [xr + Xs] (6) 

i=l 

where x^. and Xg are support vectors, >0, yr = 1, Vs = — 1- 

For a new data point x, the classification is then, 

/(x) = sign (w • x + 6) = sign dj?/i(xi • x) + 6^ (7) 

To generalize the OSH to the non-separable case, slack variables are intro- 
duced |2j. Hence the constraints of o are modified as 

y, [(w-x,)-h5] > 1-^i, ^i>0, J=1,...,Z (8) 

The generalized OSH is determined by minimizing, 

^ II w f (9) 

i=l 

(where C is a given value) subject to the constraints of (|EI). 

This optimization problem can also be transformed to its dual problem, and 
the solution is the same as ©, but adding the constraints to the Lagrange 
multipliers by 0 < < C, i = 1, . . . , 1 . 

2.2 Non-linear Mapping by Kernel Functions 

In the case where a linear boundary is inappropriate, the SVM can map the 
input vector, x, into a high dimensional feature space, z. The SVM constructs 
an optimal separating hyperplane in this higher dimensional space. In Fig.0 the 
samples in the input space can not be separated by any linear hyperplane, but 
can be linearly separated in the non-linear mapped feature space. Note that the 
feature space in SVMs is different from our texture feature space. According to 
the Mercer theorem m, there is no need to compute this mapping explicitly, the 
only requirement is to replace the inner product (x^ • Xj ) in the input space with 
a kernel function AT(xj,Xj) to perform the non-linear mapping. This provides a 
way to address the curse of dimensionality m- 

There are three typical kernel functions m- 
Polynomial 



K{yi,y) = {{x-y + l)f ( 10 ) 

where the parameter d is the degree of the polynomial. 

Gaussian Radial Basis Function 
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input space 




feature space 


• 

• 




• • 

• • 




► 


o 


• • 




o 



Fig. 3. The feature space is related to input space via a nonlinear map causing 
the decision surface to be nonlinear in the input space. By using a nonlinear kernel 
function, there is no need to do the mapping explicitly. 



A'(x,y)=exp(^-<^^) 

where the parameter a is the width of the Gaussian function. 
Multi-Layer Perception 

iL(x, y) = tanh (scale . • y) — offset) 

where the scale and offset are two given parameters. 

For a given kernel function, the classifier is thus given by, 

/(x) = sign ^ aiyiK{x^, x) + b 

\i=i 



( 11 ) 



( 12 ) 



(13) 



2.3 Discrimination between Multiple Clusters 

Previous subsections describe the basic theory of SVM for two-class classifica- 
tion. For image retrieval in a database of multiple image clusters, for instance, 
c clusters, we can construct c decision boundaries. Note that a cluster may con- 
tain more than one image class. The perceptually similar images in different 
classes are considered as one cluster. Each boundary is used to discriminate bet- 
ween the images of one cluster and all the remaining belonging to other clusters. 
The images belonging to cluster k are enclosed by the boundary. This is a 
one-against-all strategy in the context of pattern recognition. To our knowledge, 
only the recently proposed support vector machines can be used to obtain the 
boundary optimally by quadratic programming. 
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3 Similarity Measure and Ranking 



The basic idea of learning similarity is to partition the original feature space into 
clusters of visually similar patterns. 

The pair (w, b) defines a separating hyperplane or boundary of equation 

wx + 6=0 (14) 

By kernel mapping, the boundary (w, b, K) is, 

m 

^d*%Tf(x*,x)+6*=0 (15) 

1=1 



where x*(_) = 1, • • • ,m) are support vectors, a* are the linear combination co- 
efficients or weights, and b* is a constant. Usually, m < I, i.e., the number of 
support vectors is less than that of the training examples. 

Definition 1 (signed distance): 

The signed distance D(xi;w,5) of a point x,; from the boundary (w,6) is 
given by 



D{xi;w,b) = 



W • Xi 



(16) 



Definition 2 (signed distance with kernel): 

The signed distance D{xi; w, b, K) of a point x^ from the boundary (w, 6, K) 
with kernel function ■) is given by 



D{x,-w,b, K) 



Oi*yjK (x*,x,) -h b* 

II II 



(17) 



Combining Definitions 1 and 2 with equation we have 
yiD{x,;w,b,K) > 



E Ttl — 5fe 5fe 

.,=1 



(18) 



for each sample x^, and j/i = ±1, f = 1, • • • , L Therefor, we have. 

Corollary: The lower bound of the positive examples (j/i = 1) to the bo- 
undary (w, 6, if) is tjt; the upper bound of the negative examples 

{yi = —1) is — -7. The boundary {w,b,K) is between these two bo- 

unds. 

In our cluster-based similarity measure, the perceptually similar patterns 
are grouped into one cluster and labeled as positive examples, while the other 
patterns are treated as being dissimilar to this cluster. Thus we give. 
Definition 3 (similarity measure): 

The patterns x^, i = 1, • • • , Z , are said with perceptual similarity if they are 
considered as positive samples {yi = 1) and hence located inside the boundary 
(w, b, K); the samples outside are said dissimilar to the patterns enclosed by the 
boundary. 
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In the case of c clusters, we have c boundaries. 

Definition 4 (signed distance to the boundary): 

If the boundary separating cluster k from others is {wk,bk, K), the signed 
distance of a pattern to the boundary is 



D{xi;wk,bk,K) 



II Ei=i II 



(19) 



where xj^-, {j = l,---,/cm) are the support vectors, are the optimal 
Lagrange multipliers for the boundary, and 6^ are some constants, k = 

1 , * * * , c. 

Equation lit till is used to calculate the signed distances of patterns to the 
boundary. The pattern similarities (dissimilarities) are measured by Definition 

3. 

How to connect the c boundaries to each pattern in a database? It is realized 
as follows: when an image pattern x^ is ingested into the database during its 
construction, the signed distances of Xj to the stored c boundaries are calculated 
firstly by equation m, and then the index k* is selected by. 



k* = argmaxi<k<cD{^i; ~Wk,bk, K) (20) 

The index k* is therefore connected to the image pattern x^. Basically, this 
is a classification problem. Each pattern in the database will be associated with 
a boundary index. 

In retrieval, when a query image pattern is given, the boundary index k* 
connected to the query pattern is first found. Then, we use equation 1191) to 
calculate the signed distances of all samples to the k**^ boundary. According 
to Definition 3, the images in the database with positive distances to the 
boundary are considered similar. Thus we obtain the similar image patterns to 
the query. 

How to rank these similar images? The similar images obtained above belong 
to different texture classes. To rank these images, the class information should 
be taken into consideration. Assume that cluster k {k = 1, • • • , c) contains qk 
texture classes, the feature space of cluster k is further divided into qk subspaces. 
Each subspace is enclosed by a sub-boundary containing the patterns of the same 
class. Thus, we partition the feature space in a hierarchical manner: in the higher 
level, the database is divided into c clusters, with each contains the perceptually 
similar patterns inside; in the lower level, each cluster k is further divided into 
Qk texture classes. The signed distances to the sub-boundary of all image 
patterns enclosed by the boundary k* are used for ranking, if the query image 
pattern is located inside the sub-boundary q^. 

In summary, each (query) image is associated with two-level boundary inde- 
xes. The images selected by the higher level boundary are ranked by their signed 
distances to the lower level boundary. 

The hierarchical approach to texture image retrieval has two advantages: one 
is to retrieve the perceptually similar patterns in the top matches; the other is 
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to speed up the retrieval process further. Note that there is no need to compute 
the Euclidean distance between two points as in 



4 Image Retrieval Performance 

The Brodatz texture database (Q used in the experiments consists of 112 texture 
classes. Each of the 512 x 512 images is divided into 49 sub-images (with overlap), 
which are 128 x 128 pixels, centered on a 7 x 7 grid over the original image. The 
first 33 sub-images are used as the training set and the last 16 for retrieval [Z|. 
Thus we create a database of 3696 texture images for learning, and a database 
of 1792 texture images to evaluate the retrieval performance. A query image is 
one of the 1792 images in the database. 

In this paper we use a similar Gabor filter banks 0 as that derived in |H|, 
where four scales and six orientations are used. Applying these Gabor filters to 
an image results in 24 filtered images. The mean and standard deviation of each 
filtered image are calculated and taken as a feature vector 

/ = [MOO, MOl, ■ ■ • , M357 o’oo, ■ ■ • 7 0 - 35 ] ( 21 ) 

where the subscripts represent the scale (s = 0, • • • , 3) and orientation (o = 
0, • • • , 5). The dimension of the feature vector is thus 48. 

The 112 texture image classes are grouped into 32 clusters, each containing 
1 to 8 similar textures. This classification was done manually and Table 1 shows 
the various clusters and their corresponding texture classes. Note that we use 
all the 112 texture classes. 

In the learning stage, we use the Gaussian RBF kernel function with the 
parameter cr = 0.3 and C = 200. Figure 0| illustrates an evaluation based on the 
average retrieval accuracy defined as the average percentage number of patterns 
belonging to the same image class as the query pattern in the top 15 matches 
0. The comparison is between our hierarchical approach to learning similarity 
and ranking and that based on the Euclidean distance measure. Note that the 
significant better result achieved by our method. This figure demonstrates that 
our hierarchical retrieval can give better result than the traditional Euclidean 
distance based approach. Note that the learning approach in |SI gives nearly the 
same result as that based on Euclidean distance (Fig. 6-2 on page 108 of jS|). 

Since the average retrieval accuracy does not consider the perceptual simila- 
rity, another evaluation is done, based on the 32 clusters instead of just the top 
15 matches [Z]- Figure El illustrates the second evaluation result. Here the diffe- 
rences are quite striking. The performance without learning deteriorates rapidly 
after the fist 10 ^ 15 top matches, however, the retrievals based on our learning 
similarity perform very well consistently. 

Figure 0 and 0 show some retrieval examples, which clearly demonstrate the 
superiority of our learning approach. Another important issue is the hierarchical 
retrieval structure which speed up the search process. 
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Fig. 4. The retrieval performance in terms of obtaining the 15 correct patterns from 
the same texture class as the top matches. The performance improves explicitly on the 
small number of top matches considered. Note that this evaluation does not refer to 
the visual similarity. 
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Fig. 5. The retrieval performance comparison between our learning similarity approach 
and that based on Euclidean distance. The evaluation takes the perceptual similarity 
into consideration. Notice the significant improvement in our retrieval with learning 
similarity. 
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Table 1. Texture clusters used in learning similarity. The visual similarity within each 
cluster are identihed by the people in our research group. 



Cluster 


Texture Class 


Cluster 


Texture Class 


1 


dOOl d006 d014 d020 d049 


17 


d069 d071 d072 d093 


2 


d008 d056 d064 d065 


18 


d004 d029 d057 d092 


3 


d034 d052 dl03 dl04 


19 


d039 d040 d041 d042 


4 


d018 d046 d047 


20 


d003 dOlO d022 d035 d036 d087 


5 


dOll d016 d017 


21 


d048 d090 d091 dlOO 


6 


d021 d055 d084 


22 


d043 d044 d045 


7 


d053 d077 d078 d079 


23 


d019 d082 d083 d085 


8 


d005 d033 d032 


24 


d066 d067 d074 d075 


9 


d023 d027 d028 d030 d054 d098 d031 d099 


25 


dlOl dl02 


10 


d007 d058 d060 


26 


d002 d073 dill dll2 


11 


d059 d061 d063 


27 


d086 


12 


d062 d088 d089 


28 


d037 d038 


13 


d024 d080 d081 dl05 dl06 


29 


d009 dl09 dllO 


14 


d050 d051 d068 d070 d076 


30 


dl07 dl08 


15 


d025 d026 d096 


31 


d012 d013 


16 


d094 d095 


32 


d015 d097 



5 Conclusions 

We have presented a new algorithm to learn pattern similarity for texture image 
retrieval. The similar patterns are grouped into a cluster in the feature space. 
The boundaries isolating each cluster with others can be learned efficiently by 
support vector machines (SVMs). Similarity measure and ranking are based on 
the signed distances to the boundaries, which can be simply computed. The 
performance of similar pattern retrieval is significantly improved as compared to 
the traditional Euclidean distance based approach. 
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(a) Euclidean distance measure 



(b) learning similarity 



Fig. 6. Image retrieval comparison. Each query image has 15 other similar images in 
the database. The query image (d065-01) is shown at the top left in each case. Note 
that the degradation in visual similarity in the case of Euclidean distance measure. 
The images are ordered according to decreasing similarity from left to right and top 
to bottom. In the case of learning similarity, the performance continues without any 
marked degradation in perceptual similarity, even after 60 Images are retrieved. 
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(a) Euclidean distance measure (b) learning similarity 

Fig. 7. Image retrieval comparison. Each query image has 15 other similar images in 
the database. The query image (d035-01) is shown at the top left in each case. Note that 
the degradation in visual similarity for the case of Euclidean distance measure. In the 
case of learning similarity, the performance continues without any marked degradation 
in perceptual similarity, even after 90 images are retrieved. 
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Abstract. We present a simple procedure for synthesising novel views, 
using two or more basis-images as input. It is possible for the user to in- 
teractively adjust the viewpoint, and for the corresponding image to be 
computed and rendered in real-time. Rather than employing a 3D model, 
our method is based on the linear relations which exist between images 
taken with an affine camera. We show how the ‘combination of views’ 
proposed by Ullman and Basri m can be appropriately parameterised 
when a sequence of five or more images is available. This is achieved by 
htting polynomial models to the coefficients of the combination, where 
the latter are functions of the (unknown) camera parameters. We dis- 
cuss an alternative approach, direct image-interpolation, and argue that 
our method is preferable when there is a large difference in orientation 
between the original gaze directions. We show the results of applying 
the parameterisation to a fixating camera, using both simulated and real 
input. Our observations are relevant to several applications, including 
visualisation, animation, and low-bandwidth communication. 



1 Introduction 

The World Wide Web presents many novel opportunities for the display of infor- 
mation from remote sources m- For example, consider the possibility of a virtual 
museum, in which visitors can interactively inspect the exhibits. There are two 
particularly important factors in the design of software for such applications: 
Firstly, the visual appeal of the experience may be more important than the 
veridicality of the display. Secondly, the widespread adoption of a visualisation 
method would depend on the flexibility of the data-capture process. 

These needs are well served by image-based approaches m, as opposed to 
the acquisition and rendering of 3D models. Ideally, we would like to use a small 
number of input images to specify a scene, within which we can manipulate a 
‘virtual viewpoint’. Although this question has been addressed before 0, we aim 
to present a method which is more general than direct image-interpolation PEI, 
while being less complicated than tensor-reprojection [Q. 

Our approach is based on the linear combination of views theory, which was 
originally proposed by Ullman and Basri uni in the context of object-recognition. 
As has been noted elsewhere, the orthographic camera employed by Ullman 
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and Basri provides a point of contact with certain methods of visual motion 
estimation iniEi. In particular, Tomasi and Kanade nn have shown that the 
structure-from-motion problem can be defined by assembling the feature coor- 
dinates from each available image into a single matrix of measurements. The 
implicit 3D structure and motion can then be recovered as a particular factori- 
sation of this joint-image. 

These considerations are important, because the joint-image approach ex- 
tends naturally to the treatment of any number of views, obtained under per- 
spective projection m- However, (triplets of) perspective views are related by 
trilinear equations, rather than by factorisation or direct combination j 1 4i;tlJ . We 
emphasise that the present work is restricted to an ajfine imaging model, which 
admits of a more straightforward treatment. 



2 Linear Combinations of Views 



Consider a series of images. It, taken with an affine camera C at ‘times’ t = 
1,2,... ,T. If we allow the position, orientation and intrinsic parameters of the 
camera to vary from one picture to the next, we can describe the T different 
projections of a particular feature as 



X 



Xt 

yt 



= Ct 



y 

z 

1 



t = i,2,... ,r. 



( 1 ) 



where [xt yt]^ are the coordinates of the scene-point [x y z 1]^, as it appears 
in the t-th image. In accordance with the affine imaging-model, Cj is a 2 x 4 
matrix. 

Given the T images, Ullman and Basri m have shown that it is also possible 
to obtain [xt yt]^ without direct reference to the 3D scene. This follows from the 
fact that the geometry of an affine view can be represented as a linear combina- 
tion of ‘1^’ other affine views. In practice, we chose to employ an overcomplete 
representation, which allows for a symmetric treatment of the basis images |2|. 
For convenience we will define the basis images as I' =Ii and I” = It, although 
we could in principle make different choices and, likewise, we could employ more 
than two basis images 0. The coordinates of a particular feature in the target 
image. It, are then expressed as 

Xt = ao aix' + Q2y' + a^x" + a^y'' . . 

yt = bo + bix' -k b 2 y' + b^x" + b^y" . 



We can form these equations for every feature in It, and subsequently obtain 
the coefficients o„ and bn expressing the least-squares estimate of the target 
geometry. However, it should be noted that the numerical rank of this system 
is dependent on the scene, the imaging model, and the relationship between the 
camera matrices Ct, C' and C" (corresponding to It, I' and I" , respectively). 
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For example, if It and X' are related by an affine transformation of the image- 
plane, then I" is redundant. For this reason, the matrix pseudoinverse is used 
to obtain stable estimates of the coefficients a„ and 

Several other issues must be addressed before equation O) can be used as a 
practical view-synthesis procedure. Firstly, we must identify five or more corre- 
sponding features in the T images — this process was performed manually in 
the present study. Secondly, we must derive a means of rendering the synthetic 
image, while resolving those features in I' and X" which are occluded from the 
new viewpoint. These problems will be considered in isi 

A more serious limitation is that the coefficients a„ and cannot be esti- 
mated without reference to the ‘target’ coordinates [xt yt]^ ■ In other words, 
equation only allows us to synthesise those pictures which we have already 
taken. 

As outlined in the introduction, we would rather regard the T existing images 
as samples taken from a continuous motion of the camera. Our aim then, is to 
generate a convincing movie, 2i(r), where certain values of the parameter r will 
yield the original basis images, while intermediate values will yield plausible 
intermediate views. 



3 Image Interpolation 



To formalise the definition of I(r), suppose that 0 < r < 1. For consistency with 
the two basis images, we impose the following conditions: 



x{t) = x' I x{t) = x" 

T ^ 1 y \ 

y{T) = y' \ y{r) = y" 



(3) 



These requirements can be satisfied by a linear interpolation procedure, perfor- 
med directly in the image domain: 



x{t) 

v{t)^ 



(1-t) 




+ T 



(4) 



The problem with such a scheme is that the intermediate images are not at all 
constrained by the conditions imposed via the basis views. Moreover, equation 
(0 is by no means guaranteed to produce a convincing movie. For example, 
consider what happens when the two cameras, C' and C", differ by a rotation 
of 180° around the line of sight; as r is varied, every feature will travel linearly 
through the centre of the image. Consequently, at frame 1(4), the image collapses 
to a point. 

In fact, Seitz and Dyer H2] have shown that if C' and C" represent parallel 
camera^ then direct interpolation will produce a valid intermediate view. Ho- 
wever, because it is difficult to ensure that the uncalibrated cameras are in this 
configuration, it becomes necessary to rectify the basis images. 

^ If C” can be obtained by displacing C' along a direction orthogonal to its optic-axis, 
then the two cameras are said to be parallel. 
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The disadvantage of this method is that computationally, it is both compli- 
cated and expensive. The epipoles must be estimated in order to compute the 
rectifying homographiej^ — and these transformations remain underdetermined 
without the use of further constraints. Once these have been specified, I' and 
X" must be rectified, and the interpolation performed. Finally, the novel view 
has to be de-rectified. In order to generate a valid movie, the rectifying trans- 
formations (as well as the images themselves) would have to be interpolated. A 
detailed discussion of the rectification process can be found in 

Finally, we note that Pollard et al. P] have extended just such an interpo- 
lation scheme to the three-camera case, and demonstrated good results without 
performing the rectification stage. However, this is not viable when there is a 
large difference in orientation between C' and C", as is the case in many appli- 
cations. 



4 Parametric Synthesis 

In this section we will describe an alternative approach to the generation of I(r). 
It is based on the observation that the coefficients a„ and 6„ in equation © are 
functions of the camera parameters m- It follows that if the camera-motion 
can be parameterised as C(r), then there must exist functions a„(r) and 6 n(r), 
which link the virtual viewpoint to Xt, via ( 0 . For example, suppose that we 
have a prior model of an orthographic camera, which is free to rotate through the 
range 0 to (/> in the horizontal-plane, while fixating a centrally positioned object. 
If T is used to specify the viewing angle, then we can define the parameterised 
camera-matrix 



- [cos(t</)) 0 sin(r</)) O] , . 

[ 0 1 0 oj ■ 

For images generated by such simple cameras, closed-form expressions can be 
obtained for a„(r) and fy(r) m- However, real, uncalibrated images are unlikely 
to be consistent with such a model. For this reason, we propose to derive a 
parameterisation from the original image sequence. 

Because the constraints given in Q can be applied to the view-synthesis 
equation 0 , the unknown functions a„(T) and &n(r) must satisfy the following 
requirements: 



ai( 0 ) = l, 


03 ( 1 ) = 1, 


( 6 ) 


«n(0) =0, n fy 1, 


an(l) = 0, n fy 3, 




&2(0)=1, 


64 ( 1 ) = 1 , 




fy( 0 ) = 0 , n fy 2 , 


fy(l) = 0, nfy4. 





^ In principle, affine imaging produces parallel epipolar fines within each image. Ho- 
wever, this cannot be assumed in the treatment of real images. 
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These conditions allow us to propose functional forms for the variation of the 
coefficients. Where possible, we begin with linear models: 

ai(r) = 52 (t) = 1 - T, (7) 

as{T) = bi{T) = T. ( 8 ) 

The remaining coefficients are tied to zero at both extremes of t. A simple 

hypothesis generates the equations 

a„(r) = a„r(l - r), 1, 3, (9) 

6 „(r) = K„r(l - r), n 2, 4, (10) 

where 0 !„ and are (possibly negative) scalars. 



4.1 Estimating the Viewpoint Parameter 



We now have prototypes for the variation of the coefficients, as functions of the 
viewpoint parameter, r. The constraints imposed above ensure that the basis- 
views are correctly represented, but we also have T — 2 intermediate samples 
from each of these functions, in the form of the a and b coefficients expressing 
each ‘keyframe’. It, in terms of I' and I" . However, we do not know the actual 
value of r corresponding to each image It — we will refer to this unknown as tj, 
with t = 1, . . . ,T, as before. In practice, we can estimate by assuming that 
our simple hypotheses fZHHI) are valid, in a least-squares sense, for a or b, or for 
both. For example, we can posit that 03 (t) = r, and that ai(r) = 1 — r. From 
this overcomplete specification we can obtain an estimate, f* of Tt according to 

Tt = min ^(03 - rf -k (oi -k r - 1 )^^ , 

where the a coefficients (estimated via the pseudoinverse, as indicated in @ are 
particular to each target image It ■ The solution of the least-squares problem is, 
of course 



n = 2(1 - «i + 03), 



( 11 ) 



subject to the validity of equations (0 and 0 . The use of the a coefficients in the 
above procedure is governed by the expectation that they will show significant 
variation during a horizontally oriented motion of the camera. If the path were 
closer to a vertical plane, then it would be preferable to use the b coefficients 
instead. In general, we could use a least-squares estimate (if necessary, weighted) 
over all four coefficients in ( 00 , where the latter would be estimated via the 
pseudoinverse, as before. 



4.2 Modelling the Coefficients 

Should the the hypothesised linear and quadratic equations ( 171-11 ull prove poor 
approximations to the behaviour of the coefficients, we can add further terms. 
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obeying as follows. Consider the cubic models 

ai(r) = 1 - r + e(l - r)^, a 3 (r) = r + Cr^, ( 12 ) 

62 (t) = 1 — r + 1^(1 — t)^, 54 (r) = r + ^T^ (13) 

and 

a„(r) = a„r(l - r) + /3„r^(l - t) + 7 „r(l - r)^, ny^l,3, (14) 

&n(r) = k„t( 1 - r) + A„t^( 1 - r) + /i„r(l - t)^, ny^2,4. (15) 



We summarise the fitting procedure as follows: For each target It, we use equa- 
tion (PI) to compute the coefficients of the linear combination, a„ and Next 
we use equation CH) to estimate the value of corresponding to each It , where 
the latter are viewed as ‘keyframes’ in the movie, I{t). This enables us to use 
our T — 2 samples {a„, 6„} to estimate the functions a„(r) and hn{T). 

For example, from d, we have the following cubic hypothesis for the va- 
riation of coefficient an^i,s over the T target images 



an(n) 

an{T2) 




1 

P> 

?" 

1 





fl(l-fl) fl2(l-fl) Tl(l-fl)^ 
f2(l-T2) r2^(l-f2) T2(1-T2)2 

tt{^ — tt) tt^(1 — tt) tV(1 — ttY 



” 




g 




fln 




.Tn. 



(16) 



Such an equation can be solved by standard least-squares techniques, such as 
Cholesky decomposition. Note that the system is exactly determined when T = 
5, resulting in three ‘keyframe’ equations per a„, and likewise three per bn- 



5 Rendering 

Section 0 described a method for estimating the positions of image features in 
novel views. A significant advantage of the linear-combinations approach is that 
good results can be obtained from a small number of control-points (we typically 
used around 30-70 points per image). 

However, when we come to render the novel view, we must obtain a value for 
every pixel, not just for the control-points. One way to fill-in I{t) is to compute 
the regularised optic-flow from the basis images Q , although this incurs a large 
computational cost. Furthermore, the estimation of optic-flow is error-prone, and 
any outlying pixels will noticeably disrupt the new image. We therefore prefer 
to use a simple multiview-warping scheme P], as described below. 

5.1 Image Triangulation and Warping 

In order to render the novel view, we use the estimated positions of the control- 
points to determine a mapping from each of the basis images. This can be achie- 
ved by performing a systematic triangulation of the control points, as they ap- 
pear in the current frame of I{t). The triangulation is then transferred to the 
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control-points in each basis view, thus ensuring the consistency of the mapping 
over all images. 

We employed a constrained Delaunajfl routine m, which allows us to gu- 
arantee that different image-regions will be represented by different triangles. 
Furthermore, the constraints can be used to effectively separate the object from 
the background, by imposing an arbitrary boundary on the triangulation. 

Once the triangulation has been transferred to I' and I", we can use it to 
warp each basis-image into registration with the novel view I(t). In fact, this 
is a special case of the texture-mapping problem, and was implemented as such, 
using the OpenGL graphics library HU- Because each mapping is piecewise- 
affine, the intra-triangle interpolation is linear in screen-space, which simplifies 
the procedure. Finally, bilinear interpolation was used to resolve the values of 
non-integer coordinates in X' and X" . 



5.2 Computation of Intensities 

In the previous section, we showed how the basis views can be warped into 
registration, such that corresponding image regions are aligned. The novel view 
can now be rendered as a weighted sum of the (warped) basis images 

X{t) = w'X' + w"X" . (17) 



As will be described below, the weights w' and w" are also functions of r, 
although this property will not be made explicit in the notation. 

If the present frame oiX{r) happens to coincide with either X' or I", then the 
other basis image should not contribute to the rendering process. This imposes 
the following requirements on w' and w"\ 



X{t) 



X' 



w' = I 
w" = 0 



X{t) = X" 



w' =Q 
w" = 1 



(18) 



In fact, it makes physical sense to impose the additional constraint w' w" = 1, 
such that equation jni) becomes a barycentric combination. Using the results of 
0 it would be possible to define the weights as w' = 1 — t and w" = r. However, 
this would be unsatisfactory, because r specifies the notional 3D viewpoint, 
which may be a poor measure of the 2D image relationships. For this reason, we 
follow 12], and use equation 0 to derive appropriate weights. Specifically, we 
define distances d' and d” of X(t) from X' and X" respectively: 

d'^ = asir)'^ 04(r)^ -|- bsir)'^ &4(t)^ (19) 

= ai(r)^ -I- 02(r)^ -|- 6i(r)^ -|- &2 (t)^. (20) 



® Once the triangulation has been transferred to another point-set, only the adjacency 
properties are necessarily preserved. 
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As in m, we then compute the weights according to 



w 



/ 



d'2 + d"2 



d'2 

d'2 + d"2 ■ 



(21) 



Having satisfied conditions we can now compute the novel view, using 

equation (nzj. This potentially slow operation was implemented via the OpenGL 
accumulation buffer, which operates in hardware (where available). Clearly the 
method generalises immediately to colour images, by treating each spectral band 
as a luminance component 0. 



5.3 Hidden-Surface Removal 

An issue which must be addressed by all view-synthesis schemes is the treatment 
of missing or inconsistent data during the rendering process. Clearly, it is only 
possible to portray those features in I(r) which were present in at least one 
basis-image. If a new feature does appear, then the method described above 
renders a mixture of the (spatially) corresponding regions in the basis-images, 
thereby ensuring that, though it may be incorrect, each frame of T(r) is at least 
continuous. 

The problem of consistency is more tractable, as it is possible to remove any 
parts of the basis-images which are occluded from the novel viewpoint. Moreover, 
although occlusion events are generated by the 3D nature of the scene, it does 
not follow that we have to explicitly recover the 3D structure in order to resolve 
them. Rather, it is sufficient to obtain the affine depth jS| of each control-point. 
Geometrically, this quantity can be understood as the relative deviation from a 
notional world plane, where the latter is defined by three distinguished points in 
the scene. It is straightforward to compute the affine depth, using the positions of 
the control points in two basis-images, together with the constraint that the three 
distinguished points have affine depths equal to zero. Once the measurements 
have been made at the control-points, we use the standard z-buffer algorithm to 
interpolate the depth over each triangle, and to remove the occluded regions. 

A further consequence of this process is that the combination of views 0 
can be recomputed at each of the T — 2 keyframes, excluding those control- 
points which occupy hidden surfaces in It- This should lead to a more accurate 
synthesis, because the occluded points are likely to be outliers with respect to 
the original estimate 0. 

6 Results 

The methods which have been described were tested by simulation, using 3D 
models defined in the OpenGL environment. Several scenes (comprising arran- 
gements of cubes) were imaged, while the viewpoint was subject to systematic 
variation. Both perspective and affine projections were recorded, and the matrix 
pseudoinverse was used to solve equation 0 . As well as exact control of the ca- 
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mera model, the simulation procedure provides noiseless image coordinates for 
each control-point, and allows us to monitor occlusion effects. 

The graphs in figures ^ and |2| show how the coefficients of the linear com- 
bination evolve as the (perspective) camera rotates through 90° in a horizontal 
plane. In the case of real images, it is likely that these curves would be disrup- 
ted by occlusion effects. As described in ^5., 'If it would be possible to avoid this, 
although we have not yet implemented the necessary re-estimation procedure. 



(1) (2) (3) (4) 

1 . 0 

a 0.5 

0 . 0 
1 . 0 

b 0.5 

0 . 0 

1 21 
t 

Fig. 1. Variation of the linear-combination coefficients (top; ai, 02,03, 04; bottom; 
&i, 62,^3, 64). The simulated camera was rotated through 90 ° around the vertical axis 
of a group of six cubes. T = 21 pictures were taken in total. The functions obey the 
conditions 0, as expected, though 02,4 and 61^3 (which were zero) have been shifted 
up by 0.5 for the purpose of this display. The nonzero variation of the &2,4 coefficients 
is attributable to perspective effects. 




When plotted on the same axes, the fitted models dEHISl) are indistinguis- 
hable from the measured curves shown in figures Q] and |21 For reasons of space, 
we proceed directly to the final error-measures, produced by using the fitted 
models to position the control-points in the existing T images. The two graphs 
in figure 0 show the r.m.s. discrepancy (in x and y respectively) between the 
estimated and actual control-points. For t — 1 and t = 21 the error is zero, 
because X' = X^ and X" = I 21 respectively. 

We have not yet implemented the estimation of T( (as described in H4.1 1 
from a real image sequence. However, preliminary results have been obtained 
by applying the functions a„(r) and 6„(r), obtained by simulation, to real pairs 
of basis-images. For example, we produced the novel views shown in figure El 
by applying the functions shown in figures E and El to the two framed images. 
Once the polynomial models a„(r) and 6n(r) have been obtained, they can be 
resampled to yield an arbitrary number of intermediate images. 

In the light of our comments in §21 we note that the example shown in 
figure 0is rather straightforward, in that the camera-motion is extremely simple. 



200 



M.E. Hansard and B.F. Buxton 



( 0 ) 




1 21 
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Fig. 2. Variation of coefficients ao 
(top) and bo (bottom). These are the 
constant terms in the linear combina- 
tion 0, which are not of comparable 
magnitude to the coefficients plotted in 

fig-0 



rms- error 




1 21 
t 

Fig. 3. Mean error of the parametric- 
synthesis, measured in x (top) and y 
(bottom). The units are pixels, where 
the original images were of size 256 x 
256. 



Nonetheless, rotation around the object is a typical requirement in visualisation 
applications. 

As far as the image-quality is concerned, the results seem promising; in parti- 
cular, we note that the left and right edges of the mask are appropriately culled, 
according to the viewpoint. The constrained triangulation also seems to perform 
adequately, although there is some distortion of the top-right edge of the mask 
during the early frames. This may indicate that insufficient control-points were 
placed around the perimeter. 

7 Conclusions and Future Work 

We have described a simple method of parametric view-synthesis, developed with 
the requirements of Web-based visualisation in mind. Our results are also rele- 
vant to other applications of the linear combinations of views theory, including 
animation 0 and low-bandwidth video-transmission p). 

In this outline we have concentrated on simple motions of the camera, because 
these are typical of visualisation applications. In principle, the same approach 
could be applied to general conjunctions of rotation and translation. However, 
it is to be expected that arbitrary motions will add complexity to the functions 
a„(r) and &„(r), which may demand the addition of further polynomial terms 
to the model (1 1 2H I bl) . Arbitrary camera trajectories may also require the speed 
of the parameterisation to be regulated. 

In future, it may be possible to extend the present method to cover a region 
of the view-sphere, rather than just a particular path of the camera; this would 
require two parameters, {0, (j)}, in place of r. Because of the increased variety 
of intermediate images, such an extension would also require the use of more 
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Fig. 4. A simple example of parametric view-synthesis, using 74 control-points. The 
basis-images are shown framed; the other thirteen views were generated as described 
in the text. The viewpoint is ‘extrapolated’ slightly in the final frame. 



than two basis-views. This leads to the general question of how the quality of 
the results is related to the number of basis images employed, and whether it is 
possible to select appropriate basis-views automatically. 

If these issues can be resolved, it may be possible to extend the methods 
described above to the perspective case, via the trifocal tensor b) . 
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Abstract. This paper presents a method of matching ambiguous feature 
sets extracted from images. The method is based on Wilson and Han- 
cock’s Bayesian matching framework PP, which is extended to handle 
the case where the featnre measnrements are ambiguons. A multimo- 
dal evolutionary optimisation framework is proposed, which is capable 
of simultaneonsly prodncing several good alternative solntions. Unlike 
other multimodal genetic algorithms, the one reported here requires no 
extra parameters: solution yields are maximised by removing bias in the 
selection step, while optimisation performance is maintained by a local 
search step. An experimental stndy demonstrates the effectiveness of the 
new approach on synthetic and real data. The framework is in principle 
applicable to any multimodal optimisation problem where local search 
performs well. 



1 Introduction 

Graph matching problems have pervaded computer vision since the early 1970s, 
when Barrow and Popplestone used graphs in |2| to represent spatial relations- 
hips between scene components. Minsky’s “frames” extended this idea using a 
hierarchical representation with image features at the bottom level and scene 
knowledge |5| . These ideas have given rise to representations such as aspect gra- 
phs and part-primitive hierarchies m have found favour. Comparing relational 
representations is a central task in many areas of computer vision, among which 
are stereo matching p], image registration object recognition and image 
labelling [Z|. The main problem is that the two graphs are often different, so that 
it is not possible to locate an isomorphism. The inexact graph matching problem 
is to locate the best match between the graphs. Early attempts at inexact mat- 
ching used heuristics to reconcile dissimilar structures [iSil] . These heuristics have 
been augmented by information theoretic and probabilistic criteria PDEIIII- 
Over the last ten years, there has been more interest in using statistical evi- 
dence from the scene, instead of purely structural matching H2irg. Wilson and 
Hancock have recently developed a Bayesian formulation which naturally com- 
bines the structural constraints imposed by the graphs with evidence from the 
scene |T| . This formulation provides a distance measure which can be optimised 
with gradient ascent, relaxation or genetic algorithms |14libli6| . However, the 
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classification technique used by Wilson and Hancock makes the assumption that 
there is a single best match. This approach works best where each of the features 
is distinctly located in the feature space, when it is possible to make a single 
minimum risk assignment. However, when the features are poor in the sense 
that there is considerable overlap in the feature space, restricting attention to 
the most probable assignments incurs a substantial risk of ignoring alternatives 
which are only slightly less good. For example, in a situation where there are two 
possible assignments with probabilities close to 0.5, it would be unwise to ignore 
the less likely one. This paper will demonstrate how to overcome this difficulty 
for graph matching in the context of Wilson and Hancock’s Bayesian framework. 

The standard method of enumerating a set of good solutions to an optimi- 
sation problem is to restart the optimiser, possibly with modifications to avoid 
revisiting optima HZ!. However, this implies discarding valuable information un- 
covered in the optimisation process. An alternative is to use population based 
optimisation techniques, of which the genetic algorithm mm is the most well- 
known. 

The idea that genetic algorithms can be used to simultaneously find more 
than one solution to a problem was first mooted by Goldberg and Richardson 
in m- They attempted to prevent the formation of large clusters of identical 
individuals in the population by de-rating the fitness function. Other techniques 
include crowding 1211, sequential niching ini, and distributed genetic algorithms 
pg. A common feature of these approaches has been the necessity for extra pa- 
rameters. Niching and crowding strategies typically require two or three extra 
parameters to be controlled. These parameters are needed, for example, to deter- 
mine when to de-rate the fitness of an individual, by how much, and the distance 
scale of the de-rating function. In distributed algorithms, it is necessary to de- 
cide how to arrange the sub-populations, their sizes, and under what conditions 
migration between them may occur. In |28j . Smith and co-workers demonstrated 
a situation in which niching could occur in a standard genetic algorithm, without 
the need for any extra parameters. The authors have demonstrated elsewhere 
that unmodified genetic algorithms are capable of finding many solutions to line 
labelling problems m- This paper will obtain similar behaviour for graph mat- 
ching, and show how suitable algorithm modifications can improve solution yield 
without introducing any new parameters. 

This paper will show that using a genetic algorithm as an optimisation fra- 
mework for graph matching could allow a vision system to follow Marr’s princi- 
ple of least commitment m- The outline of this paper is as follows. The next 
section reviews and extends Wilson and Hancock’s Bayesian graph matching 
criterion. Section 0 considers the implementation of a genetic algorithm for am- 
biguous graph matching problems. An experimental study is presented in section 
0 Finally, section 0 draws some conclusions and suggests directions for future 
research. 
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2 Bayesian Matching Criterion 



The problems considered in this paper involve matching attributed relational 
graphs An attributed relational graph is a triple, G = (V,E, A), where V 
is the set of vertices or nodes, E C V x V is the set of edges, and A C V x is 
the set of measurement fc— vectors relating to the original scene. Graph matching 
is the problem of establishing a correspondence between a data graph, Gd = 
( V^i , E^i , Ad ) , and a model graph, Gm = (Vm , Em , Am) • This correspondence, 
/ : Vd '-t Vm U {^}, is a labelling of the nodes in Vd with nodes from V m or a 
special null label, (f>, for unmatchable nodes. 

In m, Wilson and Hancock described a framework in which both neigh- 
bourhood structure and node attributes were combined in a single measure of 
matching consistency. The goal is to optimise the a posteriori probability of the 
match given the measurements: 



^’(/|Ad, Am) 



p(Ap, AmI/) 
p{Ad, Am) 



( 1 ) 



where P{f) is the structural component of the criterion. The joint measu- 
rement density, p{Ajj, Am), only depends on the measurements and is thus a 
static property of the problem which can be ignored when comparing matches. 
The conditional measurement density, p(Ad, Am|/)> depends on both the cur- 
rent match and the measurements. Wilson and Hancock showed that, assuming 
conditional independence of these measurements given the current match, the 
conditional measurement density can be factorised over the tuples in / to give 



p(Ad, Am|/) 



P{u,v\xu, 

{u,v)ef 



Xv) 



Pjx 

P{u, v) 






( 2 ) 



where the posterior matching probability, P(u, vjxu, Xy), is the probability 
of node u from the data graph matching node v in the model graph given their 
measurements, Xu and Xy. Like p{Au, Am), the unconditional density, p{xu, Xy), 
is independent of the current match, /. Assuming that the matching priors, 
P{u,v), are uniformly distributed, equation d can be written 



-P(/|Ad, Am) oc 



P{u,v\Xu,Xy) 

{u,v)ef 



Pif) 



(3) 



Wilson and Hancock used this relationship to formulate the MAP update 
rule given in ^ for iteratively improving the match, /. The mapping of data 
graph node u, was chosen from the union of the model graph node set, V^, with 
the null label, {4>}, according to: 



/(u) = arg max P{u,v\Xu,Xy)P{f) (4) 



The structural component of the criterion, P{f), although an essential in- 
gredient, is not a primary concern in this paper. To summarise, the structural 
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constraints on matching can be explicitly captured by defining a dictionary of 
legal assignments, 0j, over the neighbourhood of each node, j, in the data graph. 
Wilson and Hancock then applied Hancock and Kittler’s Bayesian formulation 
of dictionary based relaxation in which a memoryless error process is as- 
sumed to have produced the current assignments of the j**' neighbourhood, Fj, 
by corrupting the dictionary item. Si. This error model leads to a structural 
consistency criterion which is exponential in the distance between assignments 
and dictionary items, D. 

E S E ^M-keD{r„s,)] (5) 

' tGVr, I SiGe,j 

where = (1 — and fee = In ^ ^ ^ . The probability of assignment 

corruption, Pg, can be used as a control variable in a manner analogous to the 
temperature in simulated annealing In |2B|, Myers, Wilson and Hancock have 
shown that the Levenshtein distance was the most appropriate choice for D I2n|. 

Wilson and Hancock also used a feasibility heuristic to screen out highly 
unlikely mappings. If the difference in the neighbourhood sizes of a data graph 
node, u, and a model graph node, v, is above some threshold, the mapping, 
f{u) = V, is considered infeasible. This heuristic has been very successful in 
reducing the size of the search space. 



2.1 Measurement Ambiguity 

The measurement information contributes to the matching criterion via the po- 
sterior matching probability, P{u,v\xu,Xy), which has yet to be defined. In P, 
Wilson and Hancock defined it in terms of the Euclidean distance between at- 
tribute pairs for non-null mappings: 



P{u,v\Xu,Xy) = 






exp 


-(Xu-Xv)^ 




2<5-2 




-(Xu-Xuj)'^ 





if u = 



otherwise 



(6) 



where is the prior probability of a null match, f{u) = <j), which may be 
and ay is the estimated variance of Xy. This effectively 



taken as 2 



|vM-|VmI 

|Vd|-HVm| 



regards the model graph node measurement, Xy, as a mean about which the data 
graph node measurement, Xy, varies with estimated variance (t^, under the null 
hypothesis that the two measurements are the same (because the nodes match) . 
This approach requires the assumption that a data measurement is only likely 
to be statistically close to one of the model measurements. This is ideal when 
there is little overlap between classes, e.g. for possible angles of line-fragments 
segmented from a radar image. However, if there is significant overlap, e.g. in the 
average intensities of regions, such a scheme will not reflect these ambiguities in 
its classification of features. 
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The alternative is to compare the data measurements to the model measu- 
rements using an artificial scale. This can be done by considering the number of 
standard deviations separating the data measurement from its class mean under 
the null hypothesis that the nodes match. Table Ogives an example of such a 
scale for the arbitrary classes “similar”, “comparable”, and “different”. 



Table 1. Example Scale for Measurement Comparisons. 



Class 


Range of standard deviations from 


Similar 


[0,1.0] 


Comparable 


(1.0,2.0] 


Different 


(2.0,ooj 



Consider the standardised distance, = ||a:„ — Xy\\/uy. The probability 
that Xu lies within [a, b] standard deviations on either side of Xy is twice the 
standard Normal integral from a to b: 



P{a < Auv < b) 



dz 



= erf 




— erf 




( 7 ) 



Each of the classes in table E corresponds to a separate interval which must 
be considered. Rather than introduce so many extra parameters, it is better to 
simplify the classification to “similar” if Z\„„ G [0,a] and “dissimilar” otherwise. 
Thus, P{u,v\xu,Xu) can be defined as follows 



P(m, v\Xu 



P^ if V = Ip 

— (Xu —Xu)"^ 


n n 


(1 - P^)P[Auu < a] 
{1 - P^){1 - P[Auu 


-(Xu-Xw)'^ 


2&'^ 

if ^UV ^ ^ 
< a]) otherwise 



( 8 ) 



For convenience, the original unambiguous definition is used when a = 0. At 
the cost of an extra parameter, a, ambiguous measurements can now be handled. 
The important property of equation 0 is that when a > 0, it assigns the exact 
same probability to sets of mappings, thus enabling different alternatives to be 
considered. 

The ambiguity parameter, a, has a direct interpretation as the number of 
standard units beyond which data measurements are considered dissimilar from 
the model. However, it has a more useful indirect interpretation in terms of “mat- 
ching tolerance”. The matching tolerance, T, is defined as the average proportion 
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of model measurements similar to each data measurement, and is a function of 
a for any particular graph pair as shown in equation 0, The feasibility of map- 
pings, feasible(u, v), is determined according to the neighbourhood size difference 
constraint mentioned at the end of section 0 



T{a) = — r|{(w,i') € V£) X Vm | feasible (u, u) A Z\„„ < a}| (9) 

I VdII Vm| 

Figurenjshows T as a, function of a for several synthetic graphs. The tolerance 
reaches a plateau of between 0.4 and 0.7 at values of a higher than about 2. This 
plateau is the limit imposed by the feasibility constraint, feasible ( m, z;). These 
graphs suggest that it should be possible to determine an appropriate value of a 
by estimating the proportion of “similar” features in the data set. For example, 
if this proportion is estimated to be 10%, a should be about 0.5. 



Graph 1 (20 nodes) 
Graph 2 (30 nodes) 
Graph 3 (40 nodes) 
Graph 4 (50 nodes) 
Graph 5 (30 nodes) 
Graph 6 (30 nodes) 
Graph 7 (30 nodes) 
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Fig. 1. Matching Tolerance as a Function of the Ambiguity Parameter. 



3 Genetic Algorithms 

The feasibility of using a genetic algorithm as an optimisation framework for 
graph matching in general was established by Cross, Wilson and Hancock in 
m- That work showed that the algorithm outperformed other optimisation 
techniques such as gradient ascent and simulated annealing. This section com- 
plements the previous work, adapting it with a view to enhancing solution yield. 
The algorithm used in this study is a variant of the simple genetic algorithm. 
Rather than restrict mating to the fittest individuals, it is allowed to occur at 
random. This models more closely the panmitic populations observed in nature. 
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After reproduction, both parents and offspring are subject to selection to pro- 
duce the parent population for the next generation. This process is designed to 
better exploit the diversity in the population. The algorithm is augmented with 
a gradient ascent step: optimisation of the MAP criterion in equation ^ Such a 
step would seem ideal for labelling problems since gradient ascent is a standard 
technique for solving them in an optimisation framework ld()lifll2'7ITl . The ten- 
dency of gradient ascent to get stuck in local optima should be mitigated by the 
global optimisation properties of the algorithm. 

As in equation 0 can be used directly as the fitness function. The con- 
stant of proportionality depends on the joint measurement density, p{Ao, Am), 
the unconditional density, p{xu,Xy), and the matching priors, P{u,v), which 
are assumed to have a uniform distribution. Since selection is based on the ra- 
tio of the fitness of a particular individual to the the total fitness of the entire 
population, this constant need not be explicitly calculated. 

A detailed consideration of crossover and population size cannot be given 
here due to lack of space. Suffice to say that uniform crossover E21 was found 
to perform best with the hybrid algorithm. Appropriate population sizes can be 
determined by considering how good an initial guess has to be before the gradient 
ascent step can successfully repair it. It turns out that the MAP update rule of 
equation 0 can locate optimal solutions from an initial guess in which only 10% 
of the mappings are correct. For a 50-node problem and an initial population size 
of 10, the probability that this will be the case is 0.98. So even for moderately 
large graphs, relatively small populations should be adequate. 

3.1 Selection 

In a standard genetic algorithm, selection is crucial to the algorithm’s search per- 
formance. Whereas mutation, crossover and local search are all “next-state” ope- 
rators, selection imposes a stochastic acceptance criterion. The standard “rou- 
lette” selection algorithm, described by Goldberg in m, assigns each individual 
a probability of selection, pi, proportional to its fitness, Pi. The genetic algorithm 
used here allows the population, ’®', to grow transiently and then selects the next 
generation from this expanded population. Denoting the expanded population 
by the selection probability of the i**' individual, pi, is given by 



The algorithm then holds selection trials for each “slot” in the new popula- 
tion, for a total of |’®'| trials. Since selection is with replacement, the constitution 
of the new population is governed by the multinomial distribution, and the copy 
number of a particular individual, N(i), is distributed binomially: 



and so the expectation of N{i), is E[7V(i)] = and its variance is 




( 10 ) 




( 11 ) 



Var[A^(z)] = |’®'|pi(l - Pi). 
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The search power of the standard genetic algorithm arises from the fact that 
if the individual in question is highly fit, pi will be much larger than the average, 
and hence the expectation will be that the copy number will increase. This ap- 
proach has two disadvantages. The first is that for small populations, sampling 
errors may lead to copy numbers very much higher or lower than the expec- 
ted values. This can lead to premature convergence of the algorithm to a local 
optimum. In IddI , Baker proposed “stochastic remainder sampling” , which gua- 
rantees that the copy number will not be much different from the expectation by 
stipulating that [E[A^(i)]J < N{i) < |"E[A^(i)]]. However, the larger the popula- 
tion, the less need there is for Baker’s algorithm m The second disadvantage is 
that less fit individuals have lower expectations, and that the lower the fitness, 
the lower the variance of the copy number. In other words, less fit individuals 
are increasingly likely to have lower copy numbers. When E[N{i)] falls below 1, 
the individual will probably disappear from the population. In general, the copy 
number variance decreases with decreasing fitness. Only when pi > 0.5 does the 
variance decrease with increasing fitness. This occurs when the fitness of one 
individual accounts for at least half the total fitness of the population, i.e. when 
it is at least |\I/e| — 1 times as fit as any other individual. 

In short, the problem with roulette selection is that it imposes too strict an 
acceptance criterion on individuals with below average fitness. Several alternative 
strategies have been proposed to avoid this problem. “Sigma truncation”, rank 
selection and tournament selection im all seek to maintain constant selection 
pressure by requiring individuals not to compete on the basis of their fitness, but 
on some indirect figure of merit such as the rank of their fitness, or the distance 
between their fitness and the average in standard units. Taking rank selection 
as a typical example of these strategies, the selection probabilities are assigned 
by substituting the rank of the individual for its fitness in equation im with 
the best individual having the highest rank. The implication of this is that the 
expected copy numbers of the best and worst individuals are given by: 



So, the expected copy number of the fittest individual differs from that of 
the least fit by a factor of Moreover, if |’®'e| is even moderately large, 

E[A^(worst)] will be much less than 1. Indeed, E[iV(i)] will be less than 1 for 
about half the population. Thus, under rank selection, less fit individuals are 
highly likely to disappear, even if they are quite good. 

A second alternative to roulette selection is Boltzmann selection PEEZI- This 
strategy borrows the idea from simulated annealing, that at thermal equilibrium 
the probability of a system being in a particular state depends on the tempe- 
rature and the system’s energy. The idea is that as the temperature is lowered, 
high energy (low fitness) states are less likely. The difficulty with this analogy 
is that it requires the system to have reached thermal equilibrium. In simulated 
annealing, this is achieved after very many updates at a particular tempera- 
ture. However, in a genetic algorithm this would require many iterations at each 




(12) 
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temperature level to achieve equilibrium, coupled with a slow “cooling” . Within 
the 10 or so iterations allowed for hybrid genetic algorithms, equilibrium cannot 
even be attained, let alone annealing occur. 

It would appear, then, that there is a tradeoff between premature convergence 
and the strength of the selection operator. The problem arises from the fact that 
expected copy numbers of fit individuals may be greater than one, while those 
of unfit individuals may be less than one. One way of preventing the increase in 
copy number of highly fit individuals is to use “truncation selection” , as used 
in Rechenberg and Schwefel’s evolution strategies Truncation selection 

would simply take the best lil/l individuals from the expanded population, iPe, 
to form the new population. The copy number of each individual is simply 1 or 
0, depending on its rank. Although no individual may increase its copy number, 
the selection pressure might still be quite severe, since for the algorithm used in 
this paper, |’®'e| can be as large as 3|’®'|. In other words, less fit individuals still 
disappear at an early stage. The fact that individuals never increase their copy 
number makes this a relatively weak search operator, and probably unsuitable 
for a standard genetic algorithm. However, the gradient ascent step is itself a 
powerful optimiser |Q, and may be mostly responsible for the optimisation per- 
formance of the algorithm. If this is so, selection would be a much less important 
search operator for this hybrid algorithm than it is for standard genetic algo- 
rithms. It may therefore be beneficial to trade search performance for greater 
diversity. 



Neutral Selection The benefits of stochastic selection can be combined with 
the evenness of truncation selection by selecting without replacement. This stra- 
tegy can be called “biased selection without replacement” , since it is biased first 
in favour of fitter individuals, although it may also favour less fit ones. 

The alternative is to abandon fitness based selection altogether, and rely on 
the local search step to do all the optimisation. If the genetic algorithm’s role is 
explicitly limited to assembling a good initial guess for the local search operator, 
the selection probabilities can be assigned uniformly, i.e. This 

operator is called “neutral selection” . Neutral selection without replacement can 
be implemented very efficiently by shuffling ’®'e and choosing the “top” Ifl/I 
individuals. This strategy shares the advantage with truncation selection, that 
the minimum number of individuals are excluded from the new population, but 
also maintains the global stochastic acceptance properties of standard selection 
operators. 



Elitism Elitist selection guarantees that at least one copy of the best individual 
so far found is selected for the new population. This heuristic is very widely used 
in genetic algorithms. In P0|, Rudolph showed that the algorithm’s eventual con- 
vergence cannot be guaranteed without it. The elitist heuristic can be modified 
in two ways to help maintain diversity. First, it seems natural that if the goal 
is to simultaneously obtain several solutions to the problem in hand, several of 
the fittest individuals should be guaranteed in this way. This is called “multiple 



212 R. Myers and E.R. Hancock 



elitism” . Second, if one wishes to avoid losing too many unfit individuals, the 
worst individual can also be granted free passage to the new population. This 
is called “anti-elitism”. These heuristics, together with the selection strategies 
discussed earlier, are evaluated at the end of section 0 



4 Experiments 

This experimental study establishes the suitability of the hybrid genetic algo- 
rithm for ambiguous graph matching, and compares the selection strategies di- 
scussed in the prevous section. The algorithm was tested on 30-node synthe- 
tic graphs The point sets were generated at random, and then triangulated by 
connecting each point to six of its nearest neighbours. Data graphs were genera- 
ted by randomly perturbing the node attributes, and then duplicating 10% of the 
nodes and perturbing their attributes. The intention was to simulate segmenta- 
tion errors expected of region extraction, such as the splitting of one region into 
two similar ones. 



4.1 Comparative Study 

A comparative study was performed to determine the best algorithm for ambi- 
guous matching. The algorithms used were the hybrid genetic algorithm with 
and without mutation, crossover or both (hGA, hGA-m, hGA-x and hGA-xm)0 
a hybrid version of Eshelman’s GHG algorithm gH (hGHG), and plain gradient 
ascent (HG). The experimental conditions are summarised in tabled 



Table 2. Algorithms for Graph Matching. Each algorithm, apart from HC, made 
approximately 700,000 fitness evaluations. Abbreviations: hGA = hybrid genetic algo- 
rithm, hGA-m = hGA without mutation, hGA-x = hGA without crossover, hGA-xm 
= hGA with neither mutation nor crossover, hGHC = hybrid CHC, and HC = gradient 
ascent (hillclimbing). 





hGA 


hGA-m 


hGA-x 


hGA-xm 


hCHC HC 


Population 


50 


50 


120 


120 


100 


1 


Iterations 


5 


5 


5 


5 


5 


10 


Crossover 


Uniform Uniform Uniform Uniform HUX 


n/a 


Cross rate 


0.9 


0.9 


0.0 


0.0 


1.0 


n/a 


Mutate rate 


0.3 


0.0 


0.3 


0.0 


0.35 


n/a 



^ These should be regarded as different algorithms, not merely different parameter sets 
for a genetic algorithm, because a genetic algorithm with no crossover or mutation 
is fundamentally different from one which has these operators. Eor example, the 
hGA-xm algorithm is really just multiple restarts of gradient ascent with a selection 
step. 
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Each of the algorithms listed in table 0 except HC, was run 100 times. Since 
HC is deterministic, it was only run once per graph. The results for the different 
graphs were pooled to give 400 observations per algorithm. Algorithm perfor- 
mance was assessed according to two criteria. The first was the average fraction 
of correct mappings in the final population. The second was the proportion of di- 
stinct individuals in the final population with more than 95% correct mappings. 
The results are reported in table 0 



Table 3. Graph Matching Results. Standard errors are given in parentheses. Abbre- 
viations: hGA = hybrid genetic algorithm, hGA-m = hGA without mutation, hGA-x = 
hGA without crossover, hGA-xm = hGA with neither mutation nor crossover, hGHC 
= hybrid GHC, and HG = gradient ascent (hillclimbing). 



Algorithm 


Average Fraction Correct Average Fraction Distinct 


hGA 


0.90 (0.0044) 


0.078 (0.0019) 


hGA-m 


0.88 (0.0051) 


0.040 (0.0012) 


hGA-x 


0.84 (0.0052) 


0.044 (0.00094) 


hGA-xm 


0.76 (0.0068) 


0.013 (0.00036) 


hCHC 


0.92 (0.0042) 


0.012 (0.00033) 


HG 


0.97 (n/a) 


n/a 



At first sight, pure gradient ascent appears to outperform all the other algo- 
rithms. The reason for this is partly that the gradient ascent algorithm starts 
from an initial guess in which about 50% of the mappings are correct, whe- 
reas the other algorithms start with random initial guesses. More importantly, 
the final population of a genetic algorithm typically contains solutions much 
better and worse than the average. Thus, this comparison is not really fair: a 
fairer comparison of optimisation performance comes from considering hGA-xm, 
which is multiple random restarts of gradient ascent. Furthermore, gradient as- 
cent is deterministic, and therefore always gives the same result, but the genetic 
algorithm is stochastic and may do significantly better or worse than gradient 
ascent. Indeed, the genetic algorithm occasionally found matches with 100% cor- 
rect mappings. However, the performance of gradient ascent alone suggests that 
for unambiguous problems, genetic algorithms may not necessarily be the me- 
thod of choice. Apart from pure gradient ascent, the best optimiser was hCHC, 
which is only slightly better than hGA. The results for hGA-m and hGA-x indi- 
cate that crossover and mutation are playing an active part in the optimisation 
process. Turning to the fraction of distinct individuals with over 95% correct 
mappings, it is clear that pure gradient ascent is incapable of finding more than 
one solution. The hGHG algorithm appears to converge to fewer solutions than 
the hGA algorithm. In all, the hybrid genetic algorithm (hGA) combines strong 
optimisation performance with the highest solution yield, and it is this algorithm 
which will be the subject of the remainder of this study. 
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4.2 Selection 

Two sets of experiments were conducted to evaluate different selection strategies 
with and without elitism. In each case, a hybrid genetic algorithm was used, with 
a population size of 20, and uniform crossover was used at a rate of 1.0. The 
mutation rate was fixed at 0.4. The first set of experiments used 20, 30, 40 and 50 
node graphs, and for these the population size was set to 10, and the algorithm 
run for 5 iterations. The second set of experiments used four 30 node graphs, 
with a population size of 20 and 10 iterations. Five different selection strategies 
were compared: they were standard roulette, rank, and truncation selection, and 
neutral and biased selection without replacement. Five combinations of elitist 
heuristics were considered: they were no elitism, single elitism, multiple elitism, 
anti-elitism, and a combination of multiple and anti-elitism. The experimental 
design was therefore a 5x5x4 factorial with 100 cells. The first set of experiments 
had 40 replications for a total of 4000 observations; and the second set had 50 
replications for 5000 observations. Figures Q and 0 summarise the results. 

Both plots show that neutral selection without replacement produced the 
best yields, and that truncation selection produced the worst. Biased and rou- 
lette selection strategies gave similar results, and were both outperformed by 
rank selection. Linear logistic regression analysis of both data sets confirmed 
this ranking of selection strategies. The results for elitism heuristic were not so 
convincing. It is questionable whether elitism has any overall effect: the regres- 
sion analysis of the second data set found no significant effect of varying the 
elitism strategy. The analysis of the first data set did show that either standard 
(single) or multiple elitism gave significantly better yields, but that the effect 
was small. 




Fig. 2. Average Yields versus Selection and Elitism I. Data from all four graphs has 
been pooled. 



Least Committment Graph Matching by Evolutionary Optimisation 



215 




Fig. 3. Average Yields versus Selection and Elitism II. Data from all four graphs has 
been pooled. 



4.3 Real Images 

The problem motivating this paper is the registration of point sets derived from 
image features. Unlike Wilson and Hancock’s point sets P, the problems conside- 
red here do not generally produce unambiguous measurements. Take for example 
the case of the stereogram in figure 0 Regions were extracted from the greyscale 
image pair (panels (a) and (b), an office scene taken with an IndyCam) using 
a simple thresholding technique. Each image contained 50 regions. The region 
centroids were Delaunay triangulated using Triangle |42| . The average grey level 
over each region was used for the attribute information, shown in panel (c). The 
overlap between left and right image measurements indicates that the images 
are matchable. However, the overlap between left and left, and right and right, 
image attributes suggests that unambiguous assignments will not be possible. 
The Delaunay triangulations were matched using a hybrid genetic algorithm 
with neutral selection. The population size was set to 5, and 5 iterations were 
allowed. The crossover and mutation rates were 1.0 and 0.5 respectively. Panel 
(d) shows an initial guess in which none of the mappings is correct. Panels (e) 
to (g) show the three distinct solutions found. There were 50 regions in the left 
image of which 42 had feasible correspondences in the right. The amount of 
relational corruption between the two triangulations was estimated at around 
35% by counting the number of inconsistent supercliques given the ground truth 
match. Despite the significant relational corruption, the three solutions had 98%, 
93% and 95% correct mappings. 
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(a) Left Image 




(b) Right Image 



(c) Average Region 
Intensities 





(d) Initial Guess (0%) 



(e) Final Match (98%) 





(f) Final Match (93%) 



(g) Final Match (95%) 



Fig. 4. Uncalibrated Stereogram 1. The camera positions are not known. 



5 Conclusion 

This paper has presented a method of matching ambiguous feature sets with a 
hybrid genetic algorithm, which does not require any additional parameters to 
achieve multimodal optimisation. This allows the principle of least commitment 
to be applied to such problems. The first contribution made was to adapt 
the Bayesian matching framework, due to Wilson and Hancock P, to handle 
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ambiguous feature measurements. Rather than finding the most probable as- 
signment, the new framework appeals directly to the underlying measurement 
distributions to classify data features as similar or dissimilar to model features, 
and to assign probabilities to those classes. 

If most of the optimisation is undertaken in the gradient ascent step, the 
tradeoff between effective search and maintenance of diversity, which must be 
made in choosing a selection operator for standard genetic algorithms, can be 
abandoned. Neutral selection without replacement maximises the diversity in 
the next generation with no regard to individuals’ fitnesses. This operator was 
shown in section ^21 to provide the highest solution yields. 

There are a number of interesting directions in which this work could be 
taken. First, the ambiguity parameter or matching tolerance, defined in section 
IZH must at present be estimated graphically by considering what proportion 
of features are statistically similar. It should be possible to infer suitable va- 
lues of this parameter more robustly, given a set of training examples. A final 
observation is that in figure 0 the variations in the solutions tend to concern 
assignments which are incorrect. This raises the possibility of directing a vision 
system’s focus of attention to those parts of a scene which are most problematic. 
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Abstract. New applications in fields such as augmented or virtualized 
reality have created a demand for dense, accurate real-time stereo recon- 
struction. Our goal is to reconstruct a user and her office environment 
for networked tele-immersion, which requires accurate depth values in a 
relatively large workspace. In order to cope with the combinatorics of 
stereo correspondence we can exploit the temporal coherence of image 
sequences by using coarse optical flow estimates to bound disparity se- 
arch ranges at the next iteration. We use a simple flood fill segmentation 
method to cluster similar disparity values into overlapping windows and 
predict their motion over time using a single optical flow calculation per 
window. We assume that a contiguous region of disparity represents a 
single smooth surface which allows us to restrict our search to a narrow 
disparity range. The values in the range may vary over time as objects 
move nearer or farther away in Z, but we can limit the number of dispa- 
rities to a feasible search size per window. Further, the disparity search 
and optical flow calculation are independent for each window, and allow 
natural distribution over a multi-processor architecture. 

We have examined the relative complexity of stereo correspondence on 
full images versus our proposed window system and found that, depen- 
ding on the number of frames in time used to estimate optical flow, 
the window-based system requires about half the time of standard cor- 
relation stereo. Experimental comparison to full image correspondence 
search shows our window-based reconstructions compare favourably to 
those generated by the full algorithm, even after several frames of pro- 
pagation via estimated optical flow. The result is a system twice as fast 
as conventional dense correspondence without significant degradation of 
extracted depth values. 



1 Introduction 

The difficulty of creating and rendering the geometry of virtual worlds by hand 
has led to considerable work on using images of the real world to construct rea- 
listic virtual environments As a result the ability to reconstruct or 

virtualize environments in real-time has become an important consideration in 
work on stereo reconstruction. Unfortunately, to accurately reconstruct reasona- 
bly large volumes, we must search large ranges of disparity for correspondences. 
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To achieve both large, accurate reconstructions and real-time performance we 
have to exploit every option to reduce the amount of calculation required, while 
maintaining the the best accuracy possible to allow interaction with live users. 

One possible avenue to improve the temporal performance of reconstruction 
on an image sequence is to take advantage of temporal coherence. Since the same 
objects tend to be visible from frame to frame we can use knowledge from earlier 
frames when processing new ones. There is however a very real and complicated 
tradeoff between added calculation for exploiting temporal coherence, and its 
advantages in simplifying the stereo correspondence problem. 

In this paper we propose a segmentation of the image based on an initial 
calculation of the full disparity map. We use a simple flood All method to extract 
windows containing points in a narrow disparity range, and then use a local linear 
differential technique to calculate a single optical flow value for each window. 
The flow value allows us to predict the location of the disparity windows in a 
future frame, where we need only consider the updated disparity range for each 
window. Essentially we are making the assumption that a contiguous region 
with similar disparity will belong to a single surface and will thus exhibit similar 
motion (generated flow), to simplify our calculations and speed up our algorithm. 

Several real-time stereo systems have become available in recent years in- 
cluding the Triclops vision system by Point Grey Research {www.ptgrey.com). 
Triclops uses three strongly calibrated cameras and rectification, Pentium/MMX 
processors and Matrox Meteor II/MC frame grabbers. Its reported performance 
is about 5 dense disparity frames per second (fps) for a 320 x 240 image and 
32 disparities. The system’s performance degrades however when additional ac- 
curacy options such as subpixel interpolation of disparities are required. The 
SRI Small Vision System achieves 12 fps on Pentium II, for similar image size 
and disparity range, apparently by careful use of system cache while performing 
correlation operations 0. The CMU Video Rate Stereo Machine jH| uses special 
purpose hardware including a parallel array of convolvers and a network of 8 TI 
C40’s to achieve rates of 15 fps on 200 x 200 pixel images for 30 disparities. In 
the image pipeline however, calculations are performed on 4-5 bit integers, and 
the authors offer no specific analysis of how this affects the accuracy and reso- 
lution of matching available. Neither Triclops nor the CMU machine yet exploit 
temporal coherence in the loop to improve the efficiency of their calculations or 
the accuracy of their results. 

Sarnoff’s Visual Front End (VFE) ^2] is also a special purpose parallel hard- 
ware pipeline for image processing, specifically designed for autonomous vehicles. 
It can perform some low level functions such as image pyramid construction, 
image registration and correlation at frame rates or better. The VFE-200 is 
reported to perform optical flow and stereo calculations at 30 fps. The tasks 
described are horopter-based stereo jSl obstacle detection applications, which 
use knowledge of task and environment to reduce complexity. Registration to a 
ground plane horopter allows the correspondence search to be limited to a band 
around this plane. 
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The complimentary nature of optical flow and stereo calculations is well 
known. Stereo correspondence suffers from the combinatorics of searching across 
a range of disparities, and from occlusion and surface discontinuities. Structure 
from motion calculations are generally second order and sensitive to noise, as 
well as being unable to resolve a scale factor. Exploiting the temporal coherence 
of depth and flow measurements can take two forms: it can be used to improve 
the quality or accuracy of computed values of depth and 3D motion 11412 31 21 or, 
as in our case, can be used as means of optimizing computations to achieve real- 
time performance. Obviously approaches which compute accurate 3D models, 
using iterative approaches such as linear programming are unlikely to be useful 
for real-time applications such as ours. Other proposed methods for autonomous 
robots, restrict or otherwise depend on relative motion which cannot be 

controlled for a freely moving human subject. Tucakov and Lowe 1221 proposed 
exploiting uncertain odometry knowledge of the camera’s motion in a static 
environment to reduce the disparity range at each pixel. Knowledge of the full 
(though inaccurate) 3D motion is a stronger constraint than the coarse optical 
flow values for unknown rapid motion we use. 



RemoteA^irtual 
Office #1 



RemoteA^irtual 
Office #2 




Fig. 1. Tele-cubicle Camera configuration 



Our interest in real-time reconstruction stems from the goal of generating 
an immersive office collaboration environment. Tele-immersion is a form of net- 
worked virtual or augmented reality which is intended to push the bandwidth of 
Internet2. It entails stereo reconstruction of people and objects at remote sites, 
Internet transmission, and rendering of the resulting 3-D representations. When 
projected on special 3D displays these models will generate an immersive expe- 
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rience in the local office environment. The goal is to seem to be sitting across 
the desk from collaborators who may be half a world away. 

For our stereo algorithm this proscribes a 
number of strong performance requirements. 

The reconstructions must be dense in order 
to allow for meshing and rendering, and they 
must offer fine resolution in depth to be able 
to distinguish features on the human face 
and hands. The method must operate in real- 
time on a sufficiently large image (320 x 240) 
to provide this level of density and resolu- 
tion. Resolution in depth naturally depends 
on resolution in disparities. Figure ^ illust- 
rates a camera pair from the tele-cubicle in 
Figure E For a workspace depth w = 1 m, 
the disparity ranges from d = —185 pixels at 
point A, 35cm from the cameras to d = 50 
pixels at B, 135cm from the cameras. Clearly 
a disparity range of 50 — (—185) = 235 is pro- 
hibitive for an exhaustive correspondence se- 
arch. Finally the human subject is of course non-rigid and non-planar, and can 
move at high velocity inside the workspace. 

The current tele-cubicle configuration includes a pair of strongly calibrated 
monochrome cameras and a colour camera for texture mapping. A correlation 
stereo algorithm running on a Pentium II determines disparities, from which a 
depth map can be calculated for triangulation and rendering. Finally this trian- 
gulation is transmitted via TCP/IP to a 3D display server. The next generation 
of the tele-cubicle, illustrated in plan view in Figured will have a semi-circular 
array of 7 colour cameras attached by threes to 5 quad processor Pentium Ill’s. 
Despite this increase in processing power, we need to exploit any method which 
will improve our real-time response while not sacrificing the accuracy or work- 
volume of the tele-cubicle. 

In the following sections we will describe the disparity window prediction 
techniques we propose to enhance the real-time capability of our stereo recon- 
structor. We describe a number of experiments which look at using only the 
previous disparity and flow to predict disparities in the next frame, as well as 
using predicted window locations to search shorter disparity ranges in the next 
image in a sequence. We examine the accuracy of the methods relative to our exi- 
sting full image algorithm as well as the quality of window based reconstructions 
over several frames. 

2 Predicting Disparity Windows 

Our method for integrating disparity segmentation and optical flow can be sum- 
marized in the following steps: 



B 




Fig. 2. Verged Stereo pair configu- 
ration for work volume w. 
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Step 1: Bootstrap by calculating a full disparity map for the first stereo 
pair of the sequence. 

Step 2 : Use flood-fill to segment the disparity map into rectangular win- 
dows containing a narrow range of disparities. 

Step 3 : Calculate optical flow per window for left and right smoothed, 
rectified image sequences of intervening frames. 

Step 4: Adjust disparity window positions, and disparity ranges accor- 
ding to estimated flow. 

Step 5 : Search windows for correspondence using assigned disparity ran- 
ge, selecting ’best’ correlation value over all windows and disparities 
associated with each pixel location. 

Step 6: Goto Step 2. 

In the following sections we will discuss in some detail our existing full frame 
correlation stereo algorithm, our flood All segmentation technique, optical flow 
approximation, window update and regional correspondence techniques. 




Fig. 3. Frames 10, 12 and 18 from the left image sequence. The subject translates and 
rotates from right to left in the image. 



Correlation Stereo. In order to use stereo depth maps for tele-immersion or other 
interactive virtual worlds, they must be accurate and they must be updated 
quickly, as people or objects move about the environment. To date our work has 
focused on the accuracy of our reconstructions. 

The stereo algorithm we use is a classic area-based correlation approach. 
These methods compute dense 3-D information, which allows extraction of higher 
order surface descriptions. In general our stereo system operates on a static set 
of cameras which are fixed and strongly calibrated. Originally the system was 
created to generate highly precise surface reconstructions. 

Our full image implementation of the reconstruction algorithm begins by 
grabbing images from 2 strongly calibrated monochrome cameras. The system 
rectifies the images so that their epipolar lines lie along the horizontal image 
rows P to reduce the search space for correspondences and so that corresponding 
points lie on the same image lines. Computing depth values from stereo images 
of course requires finding correspondences, and our system measures the degree 
of correspondence by a modified normalized cross-correlation (MNCC), 

2 cov{Il,Ir) 

a^h) + a^ilR)' 



c{Il,Ir) 



( 1 ) 
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where II and Ir are the left and right rectified images over the selected corre- 
lation windows. For each pixel {u,v) in the left image, the matching produces 
a correlation profile c{u,v,d) where disparity d ranges over acceptable integer 
values. 

All peaks in the correlation profile satisfying a relative neighbourhood thres- 
hold, are collected in a disparity volume structure. Peaks designated as matches 
are selected using visibility, ordering and disparity gradient constraints. A sim- 
ple interpolation is applied to the values in the resulting integer disparity map 
to obtain subpixel disparities. From the image location, disparity and camera 
calibration parameters we can now compute a dense map of 3-D depth points 
based on our matches. We can also easily colour these depth points with the 
greyscale (or colour) values at the same locations in the left rectified image. 




a. b. 

Fig. 4. Disparity map (a) and extracted windows of similar disparity (b) (frame 12). 

Flood-fill Segmentation. It is more common to use flow fields to provide coarse 
segmentation than to use similar disparity CEa, but our existing stereo system 
provides dense disparity maps, whereas most fast optical flow techniques provide 
relatively sparse flow values. Restricting the change in disparity per window 
essentially divides the underlying surfaces into patches where depth is nearly 
constant. The image of a curved surface for example will be broken into a number 
adjacent windows, as will a fiat surface angled steeply away from the cameras. 
Essentially these windows are small quasi-frontal planar patches on the surface. 
Rather than apply a fixed bound to the range of disparities as we do, one could 
use a more natural constraint such as a disparity gradient limit |S|. This tends 
to create windows with large disparity ranges when smooth slanted surfaces are 
present in the scene however, which is what we are trying to avoid. 

Any efficient region growing method could be applied to cluster the dispa- 
rities into regions. Since our constraint is a threshold, and we allow regions to 
overlap we have chosen to use flood fill or seed fill HH pp. 137-141], a simple poly- 
gon filling algorithm from computer graphics. We have implemented a scan-line 
version which pops a seed pixel location inside a polygon to be filled, then finds 
the right and left connected boundary pixels on the current scan line, ‘filling’ 
those pixels between. Pixels in the same x-range in the lines above and below 
are then examined. The rightmost pixel in any unfilled, non-boundary span on 
these lines in this range is pushed on the seed stack and the loop is repeated. 
When the stack is empty the polygon is filled. 
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We have modified this process slightly so that boundary is defined by whether 
the current pixel/disparity value falls within a threshold (+/ — 5) of the first 
seeded pixel. We start with a mask of valid disparity locations in the disparity 
image. For our purposes filling is marking locations in the mask which have been 
included in some disparity region, and updating the upper left and lower right 
pixel coordinates of the current window bounding box. When there are no more 
pixels adjacent to the current region which fall within the disparity range of the 
original seed, the next unfilled pixel from the mask is used to seed a new window. 
Once all of the pixel locations in the mask are set the segmentation is complete. 

The disparity map for pair 12 of our test image sequence (Figure EJ is illu- 
strated in Figured, along with the disparity windows extracted by the flood-fill 
segmentation. Twenty-nine regions were extracted, with mean disparity range 
width of 14 pixels. We maintain only rectangular image windows rather than 
a convex hull or more complicated structure, because it is generally faster to 
apply operations to a larger rectangular window than to manage a more com- 
plicated region structure. A window can cover pixels which are not connected 
to the current region being filled (for example a rectangular bounding box for 
an ‘L ’-shaped region will cover many pixels that are not explicitly in the dispa- 
rity range) and therefore the windows extracted overlap. This is an advantage 
when change in disparity signals a depth discontinuity, because if a previously 
occluded region becomes visible from behind another surface, the region will be 
tested for both disparity ranges. 

As a final step small regions (< MIN-REG pixels) are attributed to noise and 
deleted. Nearby or overlapping windows are merged when the corner locations 
bounding window Wi expanded by a threshold NEAR- WIN, fall within window 
Wj, and the difference between the region mean disparities satisfies: 



Yjr, ^D{xi,yi) 
N, 



X/flj ^D{xk, Uk) 



< NEAR-DISP, 



where Ri and Rj are the set of pixels in two disparity regions, with Ni and 
Nj elements respectively. In section we examine the sensitivity of window 
properties (number, size and disparity range) to variation in the NEAR- WIN 
and NEAR-DISP thresholds. 



Flow per Window Optical flow calculations approximate the motion field of 
objects moving relative to the cameras, based on the familiar image brightness 
constancy equation: I^Vx + lyVy -I- A = 0, where I is the image brightness and 
Ix, ly and It are the partial derivatives of I with respect to x, y and t, and 
V = [vx,Vy\ is the image velocity. We use a standard local weighted least square 
algorithm wm to calculate values for v based on minimizing 

e = ^ '^ {IxXx T ^yXy T It) 

Wi 

for the pixels in the current window Wi . We do not apply an affine flow assump- 
tion because of the increased complexity of solving for 6 parameters rather than 
just two components of image velocity El- 
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a. 



b. 



c. 



Fig. 5. Flow fields computed for (a) full image and (b) segmented windows (left frames 
10-16), (c) histogram of absolute angular flow differences (degrees). 



For each disparity window we assume the motion field is constant across 
the region Wj, and calculate a single value for the centre pixel. We use weights 
w{x,y) to reduce the contribution of pixels farthest from this centre location. 



w{x,y) = 






"'y 

^^2 2 






where Ni = n^i x Uyi are the current window dimensions. 

We construct our linear system for IxVx + lyVy = —It as follows: 



A = 



w(xi, yi)Ix(xi,yi) w(xi,yi)Iy(xi,yi) 

_w{xjfi,yNi)Ix{XNi, yNi ) w{xNi, VNi ) ly {^Ni ,VNi)_ 
w{xi,yi)It{xi,yi) 



b=- 



w{xNi,yNi)h{xNi,yNi) 



where locations (xi,yi)...{xNi,yNi) are the pixels contained in window Wi. We 
can then calculate the least squares solution Av — 6 = 0 using one of several 
forms of factorization m- 

Only one optical flow value is estimated per window. Figure 0 shows the 
comparison between flow estimates for 5x5 windows across the full image and 
values computed for our segmented windows (depicted by the same vector at 
each window location) for the left image sequence frames 10-16. The figure also 
includes a histogram of the absolute angle between the flow vectors at each 
point where a valid flow exists for both the full frame and window flows. The 
angle difference is clustered around zero degrees so that the dominant flow in the 
region is on average reasonably represented. A better estimate might be achieved 
by maintaining the centroid and standard deviation in x and y of the pixels 
included in a region by the segmentation. Computing optical flow on a window 
Wfi = {Cxi — <Xxi, Cyi — cfyi, Cxi + C yi -fi G yi) wouM focus thc calculation on 

the region nominally extracted. 
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Window Flow Adjustment. For each window represented by its upper left and 
lower right corner locations W{t) = [(xui^yui), (xir,yir)], we must now adjust 
its location according to our estimated flow for the right and left images vi = 

and Vj. [^xri^yr]- 

The window corner adjustment for dt, the time 
since the last disparity, basically takes TT(t)’s upper 
left and lower right window coordinates and adds vidt 
and Vrdt to each in turn. The upper left corner is 
updated to the minimum x and y coordinates from 
Wui{t), Wui{t) + vidt and Wui{t) + Vrdt. Similarly the 
lower right coordinate is updated to the maximum 
X and y coordinates from Wir{t), + vidt and 

Wir{t) + Vrdt. This process is illustrated in Figure 0 
Basically we force the window to expand rather than 
actually moving it, if the left coordinate of the win- pjg. g. Update expands 
dow is predicted to move up or left by the right or window in flow direc- 
left flow, then the window is enlarged to the left. If tions. 
the right coordinate is predicted to move down or right the window is enlarged 
accordingly. Checks are also made to ensure the window falls within the image 
bounds ( 1 , maxc) , ( 1 , maxr ) . 

Since the windows have moved as a consequence of objects moving in depth, 
we must also adjust the disparity range D{t) = [dmin,dmax] for each window: 






D{t+dt) = [min {dmin + V^ldt - V^rdt, dmin) ,xaa.yi{dmax + V^ldt - V^rdt, dmax)]- 



Windowed Correspondence. Window based correspondence proceeds much as 
described for the full image, except for the necessary manipulation of win- 
dows. Calculation of MNCC using Equation ^ allows overall calculation of the 
terms and p.{Il) and p.{Ir) on a once per image pair basis. 

For cov{Il,Ir) = ~ l) r) however, p.{IlIr) and the product 

p{Il)pl{Ir) must be recalculated for each disparity tested. In the case of our 
disparity windows, each window can be of arbitrary size, but will have relatively 
few disparities to check. Because our images are rectified to align the epipolar 
lines with the scanlines, the windows will have the same y coordinates in the 
right and left images. Given the disparity range we can extract the desired win- 
dow from the right image given Xr = xi — d. Correlation matching and assigning 
valid matches to the disparity volume proceeds as described for the full image 
method. 

The general method of extracting and tracking disparity windows using op- 
tical flow does not depend on the speciflc correlation methods described above. 
Most real-time stereo algorithms do use some form of dense correlation mat- 
ching Pj, and these will benefit as long as the expense of propagating the win- 
dows via optical flow calculations is less than the resulting savings over the full 
image/full disparity match calculation. 
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3 Results 

3.1 Complexity Issues 

The first question we need to ask is whether the calculation of window based cor- 
relation and optical flow, is actually less expensive in practice than the full image 
correlation search over {dmini dmax)- For images of size {ux x ny) let us consider 
the operation of convolving with a mask g of size Ug. We currently do the same 
per pair calculations and for the full and regional 

matching, so we will discount these in our comparison. The term i,) g,{I r) 

over {dmax — dmin) disparities requires nxriyng{dmax — dmin) multiplications for 
the full image case and '^xi'nying{dmaxi ~ dmiru) multiplications for the set 
of extracted windows W, where Wi has dimensions {nxi,nyi). Similarly calcu- 
lating h{IlIr) will require nxnyUg{dmax ~ dmin) versus Y.w 
dmini) multiplications. We have to weigh this saving in the covariance calculation 
against the smoothing and least squares calculation per window of the optical 
flow prediction process. 

For temporal estimates over rit images in a sequence we have to smooth and 
calculate derivatives for the images in the sequence in x, y and t. We currently 
do this calculation over the entire rectified image which requires (2 x 3)ngnxnyrit 
multiplications for each of 2 (right and left) image sequences. To solve Av = 6, 
using for example QR decomposition and back substitution requires approxima- 
tely {12){nxinyi) flops per window. 

Finally for the flood-fill segmentation each pixel may be visited up to 4 times 
(once when considering each of its neighbours), but probably much fewer. The 
only calculations performed are comparisons to update the window corners and 
disparity range as well as a running sum of the pixel values in each region. The 
cost is small compared to full image correlations, so we will disregard it here. 

The window based correspondence will be faster if the following comparison 
is true: 



{2)nx‘flyTlg{dmax dmin) ^ 

(2 X 2 X 2t)fi gRxiXyTit (2) [u- xi^yi^gif^maxi ^mini )] (2) 

[(12) (na,iriyi)] . 

For the examples demonstrated in this paper the the ratio of window based 
calculations to full image calculations is about 0.55, which is a significant saving. 
This saving is largely dependent on the number of frames in time rit which are 
used to estimate derivatives for the optical flow calculation. 

3.2 Experiments 

Segmentation Thresholds. Having established this complexity relationship we 
can see that the number and size of window regions and their disparity ranges is 
critical to the amount of speedup achieved. These properties are to some extent 
controlled by the thresholds in the flood fill segmentation which determine when 
nearby windows can be merged based on their disparity range and proximity. 




230 J. Mulligan and K. Daniilidis 




a. Number of Windows b. Window Size c. Disparity Range 



Fig. 7. Variation in window properties as a result of varying the NEAR-DISP and 
NEAR- WIN thresholds in the segmentation algorithm. The x axis represents values 
(in pixels) of the NEAR-DISP threshold, the 4 curves represent values of NEAR- WIN 
(0,5,10, and 15 pixels) 



For the purpose of demonstration we will consider the sequence of images 
in Figure 0 The full disparity calculation was performed for frame 12, and 
the resulting disparity map is illustrated in Figure 0 along with the windows 
extracted by the flood All algorithm. The plots in Figure 0 illustrate the effect 
on the number of windows (a), their size in pixels (b) and the length of disparity 
range (c) associated with varying the thresholds for window merging. Initially 
the number of windows increases as small windows, which would otherwise be 
discarded, are merged such that together they are large enough to be retained. 
Eventually the number of windows levels off and slowly decreases, until in the 
limit one would have only one large window. The plots for threshold values versus 
window size in pixels and length of disparity range indicate that larger thresholds 
result in larger windows and larger disparity ranges, as we would expect. 

Another interesting question is just how 
small we can make the disparity range, since 
decreasing the correspondence search space 
was our aim when we started this work. Ge- 
nerally there will be a tradeoff between the 
size of the disparity range and the number of 
windows we have to search in. The size and 
number of the windows will also be affected 
by the NEAR- WIN threshold as illustrated 
by the two curves in Figure El These plot re- 
gion based complexity as a proportion of the 
full frame correspondence calculation, with 
respect to disparity bounds ranging from 1 
to 15 pixels. When no neighbour merging 
is done, the smaller the disparity range the 
lower the complexity. For NEAR-WIN=10 
merging many relatively distant small windows with similar very narrow (1-4 
pixel) disparity range increases window overlap, and steeply increases comple- 
xity. However, this calculation is a little simplistic because it suggests that a 




Fig. 8. Window algorithm calcu- 
lation as a proportion of the 
full frame complexity for NEAR- 
WIN=0 and NEAR-WIN=10, for 
disparity bounds from 1 to 15. 
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Fig. 9. Predicted, region-based and full correspondence disparity maps (frame 18). 




Fig. 10. Triangulations of depth points for predicted, region-based and full correspon- 
dence (frame 18). 

window with disparity range of 1 can simply be propagated forward in time 
using our window flow value. As we will see below, pure prediction is a very 
poor way of determining disparities for the next frame, and search is always 
necessary to compensate for errors in our model. 



Accuracy of Disparity Estimates. A critical issue when introducing a compromise 
such as our windowing technique is how accurate the results will be. We have 
tested two methods for estimating disparities for the current time step: a pure 
prediction approach simply using the flow and disparity value for each window 
location to calculate the new disparity according to: 

d{t + 1) = d{t) + Atvix — AtVrx- 

Second is the approach which actually calculates the new disparity via cor- 
relation on the predicted window location. The windows extracted for frame 12 
are shown in Figure 0 the calculated disparity maps for the prediction, region 
and full image methods for frame 18 in the sequence are illustrated in Figure 0 
(brighter pixels indicate higher disparity values). Eliminating small windows in 
the segmentation obviously acts as a Alter, eliminating the background noise seen 
in the full correlation. We can also examine the histograms of error in disparity 
in Figure El These take the full image calculation as correct and compare the 
prediction and region-based correlation to it {dfuu — dest)- Of course even the 
full image calculation may have correspondence errors, but in the absence of 
ground truth we will accept it as the best available estimate. 
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The pure prediction fares poorly with a mean error of 3.5 pixels, and a 
standard deviation of 17.9. The region-based correlation however has a mean of 
-0.4 pixels and a standard deviation of only 5.0 pixels. We can also examine the 
rendered versions of the data in Figure uni These views are extracted from our 
3D viewer and show the triangulated reconstructions rotated by 30° in X and 
—30° in Y . The pure prediction reconstruction is clearly wrong. It is assigning 
depth values to regions that are essentially empty in the new image, apparently 
because of an underestimate of the optical flow for some windows, or because 
the flow values do not accurately reflect rotation in depth. 

The region-based method performs 
much better. It misses some points mat- 
ched by the full correspondence, also pro- 
bably because of an underestimate of flow 
values. However it Alls in some holes visi- 
ble in the full reconstruction, probably be- 
cause the full image method finds a stron- 
ger (but erroneous) correlation, outside the 
disparity range of the windows which fall 
on this region. The triangulation method 
used to render the reconstructions deletes 
triangles with long legs, hence the holes left 
in the full frame triangulation. 

Table 1. Sequence performance statistics for regional correspondence. 
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Fig. 11. Histogram of disparity error 
versns full correspondence. 



frame 


% full calc 


% unmatched 


mean err in d 


a err in d 


5 


53.3 


0.16 


-0.49 


3.50 


9 


67.7 


0.18 


-0.14 


4.36 


13 


54.0 


0.24 


-0.58 


3.62 


17 


54.1 


0.28 


-0.57 


4.38 



Extended Sequence. Our final experiment is to observe the effect of processing 
only extracted windows over several frames as proposed by our algorithm. Can 
the system actually ‘lock-on’ to objects defined by the segmentation over time, 
or will it lose them? We ran the algorithm on frames 1-17 (561 ms) of the image 
sequence. The full disparity map was calculated for the first frame as a starting 
point, and windows were extracted. Optical flow per window was computed over 
4 frame sequences (2-5, 6-9, 10-13, and 14-17), and the region-based disparities 
were calculated for frames 5, 9, 13 and 17. The images and their corresponding 
regional disparity maps are illustrated in Figures El and 1 1 .‘il Table [D further 
breaks down the performance over time. The mean and standard deviation of 
{dfuii — dest) do not increase significantly, but the percentage of points for which 
the full calculation finds a match but the regional method does not steadily 
increases (although relatively small at 0.28%). This is what one would expect, 
since only motion increases the window size over time, and once a portion of 
an object is ‘lost’ there is no guarantee it will eventually fall in a window with 
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Fig. 12. Left image sequence frames 5, 9, 13 and 17. 




Fig. 13. Region-based correspondence disparity maps (frames 5, 9, 13, and 17). 




Fig. 14. Triangulations of depth points region-based correspondence (frames 5, 9, 13 
and 17), views are rotated by 30° in X and —30° in Y. 



appropriate disparity range again. The percentage of the full image calculation 
based on Equation |21 is about 55%. 

4 Conclusions and Future Work 

Providing dense accurate 3D reconstructions for virtual or augmented reality 
systems, in real-time is a challenge for conventional correlation stereo. One way 
to meet this challenge and to deal with the combinatorics of the long disparity 
ranges required, is to exploit temporal coherence in binocular image sequences. 
This requires a tradeoff between the benefit from motion calculations integrated 
into the stereo system, and the added cost of making these calculations. 

In this paper we have proposed a simple method for decreasing the cost 
of dense stereo correspondence with the goal of reconstructing a person in an 
office environment in real-time for use in an augmented reality tele-immersion 
system. We start by segmenting an initial dense disparity map into overlapping 
rectangular windows using a flood fill algorithm which bounds regions by limiting 
the range of disparities they contain. We predict the new window locations and 
disparity ranges in an image sequence via a single optical flow estimate per 
window. Future reconstructions are based on correspondence in the predicted 
windows over the predicted disparities. 
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We have examined the relative complexity of stereo correspondence on full 
images versus our proposed window system and found that depending on the 
number of frames in time used to estimate optical flow the window-based system 
requires about half the time of standard correlation stereo. We have demonstra- 
ted experimentally that our window-based reconstructions compare favourably 
to those generated by the full algorithm even after several frames of propagation 
via estimated optical flow. The observed mean differences in computed dispari- 
ties were less than 1 pixel and the maximum standard deviation was 4.4 pixels. 

Obviously there is much more we can exploit in the stereo-flow relationship. 
We plan to examine probabilistic fusion over time, which will allow us to associate 
window regions with similar motions and treat them as a single object. This 3D 
segmentation can then guide our meshing and rendering as well as improve the 
predictions for position and disparity range for our algorithm. In order to catch 
any new objects entering the scene we have to track changes on the boundary, 
the simplest way to achieve this is to maintain image differences to detect any 
new motion, and expand existing windows or generate new ones with a large 
disparity range. 

For our specific application in tele-immersion we plan to expand our methods 
to a polynocular stereo configuration as illustrated in Figure Dl The problem 
of combining and verifying correspondence from multiple camera pairs can be 
restricted in a similar way by projecting window volumes in {x^y,d) from a 
master pair into the images of other pairs. These windows could be tracked in 
the auxiliary pairs in much the same way we have proposed here, providing 
additional constraint and evidence for our reconstructed models. 
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Abstract. In this paper we consider the problem of estimating the fnn- 
damental matrix from point correspondences. It is well known that the 
most accurate estimates of this matrix are obtained by criteria mini- 
mizing geometric errors when the data are affected by noise. It is also 
well known that these criteria amount to solving non-convex optimiza- 
tion problems and, hence, their solution is affected by the optimization 
starting point. Generally, the starting point is chosen as the fundamental 
matrix estimated by a linear criterion but this estimate can be very inac- 
curate and, therefore, inadequate to initialize methods with other error 
criteria. 

Here we present a method for obtaining a more accurate estimate of the 
fundamental matrix with respect to the linear criterion. It consists of 
the minimization of the algebraic error taking into account the rank 2 
constraint of the matrix. Our aim is twofold. First, we show how this non- 
convex optimization problem can be solved avoiding local minima using 
recently developed convexification techniques. Second, we show that the 
estimate of the fundamental matrix obtained using our method is more 
accurate than the one obtained from the linear criterion, where the rank 
constraint of the matrix is imposed after its computation by setting the 
smallest singular value to zero. This suggests that our estimate can be 
used to initialize non-linear criteria such as the distance to epipolar lines 
and the gradient criterion, in order to obtain a more accurate estimate 
of the fundamental matrix. As a measure of the accuracy, the obtained 
estimates of the epipolar geometry are compared in experiments with 
synthetic and real data. 



1 Introduction 

The computation of the fundamental matrix existing between two views of the 
same scene is a very common task in several applications in computer vision, 
including calibration jl bll 3) . reconstruction visual navigation and visual 
servoing. The importance of the fundamental matrix is due to the fact that it 
represents succinctly the epipolar geometry of stereo vision. Indeed, its know- 
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ledge provides relationships between corresponding points in the two images. 
Moreover, for known intrinsic camera parameters, it is possible to recover the 
essential matrix from the fundamental matrix and, hence, the camera motion, 
that is the rotation and translation of the camera between views Mm . 

In this paper we consider the problem of estimating the fundamental matrix 
from point correspondences Ibl20l7l . Several techniques has been developed 0 
IMT7I . like the linear criterion, the distance to epipolar lines criterion and the 
gradient criterion The first one is a least-squares technique minimizing the 
algebraic error. This approach has proven to be very sensitive to image noise 
and unable to express the rank constraint. The other two techniques take into 
account the rank constraint and minimize a more indicative distance, the geo- 
metric error, in the 7 degrees of freedom of the fundamental matrix. This results 
in non-convex optimization problems 1 1 I p i 1 )] that present local solutions in addi- 
tion to the global ones. Hence the found solution is affected by the choice of the 
starting point of the minimization algorithm m Generally, this point is chosen 
as the estimate provided by the linear criterion and forced to be singular setting 
the smallest singular value to zero, but this choice does not guarantee to find 
the global minima. 

In this paper we present a new method for the estimation of the fundamental 
matrix. It consists of a constrained least-squares technique where the rank con- 
dition of the matrix is ensured by the constraint. In this way we impose the 
singularity of the matrix a priori instead of forcing it after the minimization 
procedure as in the linear criterion. Our aim is twofold. First, we show how this 
optimization problem can be solved avoiding local minima. Second, we provide 
experimental results showing that our approach leads to a more accurate esti- 
mate of the fundamental matrix. In order to find the solution and avoiding local 
minima, we proceed as follows. First, we show how this problem can be addres- 
sed as the minimization of a rational function in two variables. This function is 
a non-convex one and, therefore, the optimization problem still presents local 
minima. Second, we show how this minimization can be reformulated so that 
it can be tackled by recently developed convexification techniques |2|. In this 
manner, local optimal solutions are avoided and only the global one is found. 
The same problem has been studied by Hartley |0|, who provided a method for 
minimizing the algebraic error ensuring the rank constraint, which requires an 
optimization over two free parameters (position of an epipole) . However, the op- 
timization stage on these two unknowns is not free of local minima in the general 
case. 

The paper is organized as follows. In section 2, we give some preliminaries ab- 
out the fundamental matrix and the estimation techniques mentioned above. In 
section 3, we state our approach to the problem, showing how the constrained 
least-squares minimization in the unknown entries of the fundamental matrix 
can be cast as a minimization in only two unknowns. Section 4 shows how this 
optimization problem can be solved using convexification methods in order to 
find the global minima. In section 5 we present some results obtained with our 
approach using synthetic and real data, and we provide comparisons with other 
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methods. In particular, we show that our solution gives smaller geometric errors 
than the one provided by the linear criterion. Moreover, initializing non-linear 
criteria with our solution allows us to find a more accurate estimate of the fun- 
damental matrix. Finally, in section 6 we conclude the paper. 

2 Preliminaries 

First of all, let us introduce the notation used in this paper. 

K: real space; 

In' n X n identity matrix; 
transpose of A; 

A > 0 (^ > 0): positive definite (semidefinite) matrix; 

(A)jj-: entry (i,j) of A', 

11^112 (||m|| 2 ,w): (weighted) euclidean norm of u; 
det(A): determinant of A; 
adj(A): adjoint of A; 

Am (A): maximum real eigenvalue of A; 

Ker(A): null space of A. 

Given a pair of images, the fundamental matrix F G is defined as the 

matrix satisfying the relation 

u'^Fu = Q 'iu ,u (1) 

where u',u G are the projections expressed in homogeneous coordinates of 
the same 3D point in the two images. The fundamental matrix, F has 7 degrees 
of freedom being defined up to a scale factor and being singular p|. 

The linear criterion for the estimation of F is defined as 

n 

( 2 ) 

i=l 

where n is the number of observed point correspondences. In order to obtain a 
singular matrix, the smallest singular value of the found estimate is set to zero 
0. The distance to epipolar lines criterion and the gradient criterion take into 
account the rank constraint using a suitable parameterization for F. The first 
criterion defines the cost function as the sum of squares of distances of a point 
to the corresponding epipolar line. The second criterion considers a problem of 
surface fitting between the data [(u')i; {u'^ 2 ', (wi) 2 ] G R'* and the surface 

defined by (^. These non-linear criteria result in the minimization of weighted 
least-squares: 

n 

min '^w{F,u[,Ui){u'^ Fuif (3) 

F:det{F)=0 



{F^Ol 



{F^Ol 



{Fui)l 



{Fu,)l 



where 



w{F,u'i,Ui) 



(4) 
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for the distance to epipolar lines criterion and 





( 5 ) 



for the gradient criterion. The main problem with these non-linear criteria is 
the dependency of the found solution on the starting point for the optimization 
procedure, due to the fact that the cost funtion defined in o is non-convex. 
Experiments show a large difference between results obtained starting from the 
exact solution and starting from the solution provided by the linear criterion, 
generally used to initialize these minimizations 

Before proceeding with the presentation of our approach, let us review the me- 
thod proposed by Hartley for the minimization of the algebraic error constrained 
by the singularity of the fundamental matrix |2j. In short. Hartley reduces the 
number of degrees of freedom from eight (the free parameters of the fundamental 
matrix) to two (the free parameters of an epipole) under the rank 2 constraint. 
Therefore, the optimization stage looks for the epipole minimizing the algebraic 
error. Unfortunately, this step is not free of local minima in the general case, as 
figure m shows for the statue image sequences of 0. 



3 The Constrained Least-Squares Estimation Problem 

The problem that we wish to solve can be written as 




2 



0 



-1 



-2 



Fig. 1. Algebraic error versus position of epipole. 



n 




( 6 ) 



subject to det(E) = 0 
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where n > 8, S R are positive given weighting coefficients and the constraint 
ensures the rank condition of the estimation of F. Let us introduce A G 
rank(A) = 8, and b G M" such that 



^ Wi(ufFuif = \\Af - b\\l^ 



( 7 ) 



where / G R® contains the entries of F : 



'(/)l (/)4 (/)7 
F= I (/)2(/)5(/)s 
,(/)3(/)6 1 



( 8 ) 



and W G R"^" is the diagonal matrix with entries Wi. Since F is defined up to 
a scale factor, we have set (-F) 3,3 = 1 (see Remark 1). Then, ® can be written 
as 



min II Af - bll2^w 

subject to T{X)f = r(A) 

where A G R^, T(A) G R®^® and r(A) G R® are so defined: 

(A)2 0 



T(A) = 



(A)i/3 



0 (A)2 

0 0 



r(A) = (0 0 -(A) 2 )^ 



(9) 



( 10 ) 



( 11 ) 



The constraint in Q expresses the singularity condition on f as the linear 
dependency of the columns of F and hence det(F) = 0. In order to solve the 
minimization problem let us observe that the constraint is linear in / for any 
fixed A. So, the problem can be solved using the Lagrange’s multipliers obtaining 
the solution /*(A): 



r(X) = v-ST(X)P(X) [T(A)r; - r(A)] 



where v G R®, S G 



Bx8 



and P{X) G 



3x3 



are: 



V = SA^Wb, 

S={A^WA)-^, 

P{X)=[T{X)ST^{X)]-\ 

Now, it is clear that the minimum J* of can be computed as: 

J* = minJ*(A). 

A 



(12) 

(13) 

(14) 

(15) 

(16) 



So, let us calculate J*{X). Substituting /*(A) into the cost function we obtain: 

j*{x) = iiAr{x)-biii^ 

3 

= Co + y~^c»(A) 



(17) 
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where 



Co = b'^W {In - ASA^W) 6 , 


(18) 


ci(A)=r^(A)P(A)r(A), 


(19) 


C2(A) = -2u^T^(A)P(A)r(A), 


(20) 


C3(A) = v^T^{A)P{A)T{A)v. 


(21) 


The constrained problem 0 in 8 variables is equivalent to the unconstrained 
problem (1151) in 2 variables. In order to compute the solution J*, let us consider 


the form of the function J*(A). The terms Ci(A) are r 
entries of A. In fact, let us write P{A) as: 


ational functions of the 




(22) 


G{A) = adj [T{A)ST^{A)] , 


(23) 


d{A) = det [T{A)ST^{A)] . 


(24) 


Since T(A) depends linearly on A, we have that G(A) is 


a polynomial matrix of 


degree 4 and d(A) a polynomial of degree 6. Straightforward computations allow 


us to show that J*{A) can be written as: 






(25) 



where h{\) is a polynomial of degree 6 defined as: 

h{X) = cod(A) + r^(A)G(A)r(A) - 2v'^T'^(A)G(A)r(A) + v^T^{A)G{A)T{A)v. 

(26) 

Let us observe that the function d(A) is strictly positive everywhere being the 
denominator of the positive definite matrix T{A)ST^ (A). 

Remark 1 The parametrization of the fundamental matrix chosen in 0 is 
not general since (f) 3,3 can be zero. Hence, in this step we have to select one 
of the nine entries of F that, for the considered case, is not zero and not too 
small in order to avoid numerical problems. A good choice could be setting to 
1 the entry with the maximum modulus in the estimate provided by the linear 
criterion. 



4 Problem Solution via Convex Programming 

In this section we present a convexification approach to the solution of problem 
®. The technique is based on Linear Matrix Inequalities (LMI) P and leads 
to the construction of lower bounds on the global solution of the polynomial 
optimization problem OHl). More importantly, it provides an easy test to check 
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whether the obtained solution is the global optimum or just a lower bound. From 
the previous section we have that: 



Let us rewrite (EH) as: 



. h{\) 
J = mm . 
A d{\) 



J* = min 5 
A, <5 

subject to - 77 W = 0 . 



(27) 



(28) 



d{\) 



where i5 G R is an additional variable. The constraint in d2HD can be written as 
2 /(A, (5) = 0 where 

y{X,S) = h{X)-Sd{X) (29) 

since d{X) ^ 0 for all A. Hence, J* is given by: 

J* = min (5 
A,<5 

subject to y(A, i5) = 0 



(30) 



where the constraint is a polynomial in the unknown A and b. 

Problem 11,4 Dll belongs to a class of optimizations problems for which convexifi- 
cation techniques have been recently developed m The key idea behind this 
technique is to embed a non-convex problem into a one-parameter family of con- 
vex optimization problems. Let us see how this technique can be applied to our 
case. First of all, let us rewrite the polynomials h{X) and d{X) as follows: 



6 

h{X) = Y,h^{X), (31) 

6 

d{X) = E d,;(A). (32) 

i=0 

where hi(X) and di{X) are homogeneous forms of degree i. Now, let us introduce 
the function y(c; A, 5): 



y{c;X,S) = '^^^[h,{X) - cdi{X)] . (33) 

i=0 ^ 

We have the following properties: 

1. for a fixed c, y(c; A, <5) is a homogeneous form of degree 6 in A and S] 

2. y{X, S) = y{c; A, 6) for all A if 5 = c. 

Hence the form y{c]X,6) and the polynomial y{X,5) are equal on the plane S = c. 
In order to find J* let us observe that <5 > 0 because J* is positive. Moreover, 
since h{X) is positive, then y{X, i5) > 0 for <5 = 0. This suggests that J* can be 
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computed as the minimum 6 for which the function y(A, 6 ) loses its positivity, 
that is: 

J* = min {(5 : y(A, ^) < 0 for some A} . (34) 

Hence, using the homogeneous form y(c; A,S), equation (RH becomes (see P] for 
details): 

J* = min {c : p(c; A,S) < 0 for some A, i5} . (35) 



The difference between (I35II and (EJ is the use of a homogeneous form, y(c; A, S), 
instead of a polynomial, t/(A, (5). Now, let us observe that y(c; A, (5) can be written 
as: 



y(c;A,S) = z^(A,S)Y(c)z(A, 6 ) 



(36) 



where 2:(A,(5) S is a base vector for the forms of degree 3 in the variables 

(A)i, (A)2, (5: 



z(A,5) = ( (A)? (A)?(A)2 

(A)f<5 (A)i(A)i 

(A)i(A)2<5 (A)iJ 2 
(A)i (A)15 

(A)2<5" f 



(37) 



and V(c) € is a symmetric matrix depending on c. Now, it is evident 

that positivity of the matrix V (c) ensures positivity of the homogeneous form 
t/(c; A, 5) (see (EH))- Therefore, a lower bound c* of J* in ll^ can be obtained by 
looking at the loss of positivity of T(c). To proceed, we observe that this matrix 
is not unique. In fact, for a given homogeneous form there is an infinite number 
of matrices that describe it for the same vector ^(A, S). So, we have to consider 
all these matrices in order to check the positivity of y(c; A, S). It is easy to show 
that all the symmetric matrices describing the form y(c; A, A) can be written as: 



Y{c) -L , LeC 



(38) 



where C is the linear set of symmetric matrices that describe the null form: 

C={L = L^ & : z^(A, 5)Lz{A, A) = 0 VA, A} . (39) 

Since £ is a linear set, every element L can be linearly parametrized. Indeed, let 
L{a) be a generic element of £. It can be shown that £ has dimension 27 and 
hence 

27 

L{a) = Y,c^^L^ (40) 

for a given base Li, L 2 , . . - , L 27 of C. Hence, El can be written as: 

Y{c) - L{a) , a G (41) 



Summing up, a lower bound c* of J* can be obtained as: 

c* = min c 

C,Q; 

subject to min Am [L{a) — Y{c)\ > 0. 

Ck 



( 42 ) 



244 G. Chesi et al. 



This means that c* can be computed via a sequence of convex optimizations 
indexed by the parameter c. Indeed, for a fixed c, the minimization of the ma- 
ximum eigenvalue of a matrix parametrized linearly in its entries is a convex 
optimization problem that can be solved with standard LMI techniques HM]. 
Moreover, a bisection algorithm on the scalar c can be employed to speed up the 
convergence. 

It remains to discuss when the bound c* is indeed equal to the sought optimal J* . 
It is obvious that this happens if and only if y{c*] A, 5) is positive semidefinite, 
i.e. there exists A* such that y(c*;A*,c*) = 0. In order to check this condition, 
a very simple test is proposed. Let /C be defined as: 



/C = Ker [L{a*)-Y{c*)\ 



(43) 



where a* is the minimizing a for the constraint in (1^ . Then, J* = c* if and 
only if there exists A* such that z(A*, c*) G 1C. It is possible to show that, except 
for degenerate cases when dim(/C) > 1, the last condition amounts to solving a 
very simple system in the unknown A*. In fact, when K. is generated by one only 
vector fc, then A* is given by the equation: 

z{X*,c*) = ^k. (44) 

(^)io 



In order to solve the above equation, it is sufficient to observe that if (gU admits 
a solution A* then: 



(A*)i = c* 
(A*)2 = c* 



(fc)e 



(fc)io 

(fc) 



(fc)io 



(45) 



Now, we have just to verify if A* given by (I45j) satisfies (|44l) . If it does then c* 
is optimal and the fundamental matrix entries /* solution of m are given by: 



r = r(A*) 

= v- ST{X*)P{\*) [T{X*)v - r(A*)] . 



(46) 



Whenever c* be not optimal, standard optimization procedures starting from 
the value of A given by (B5I) can be employed for computing J* . This is expected 
to prevent the achievement of local minima. However, in our experiments we did 
not experience any case in which c* is strictly less than J*. 



Remark 2 In order to avoid numerical problems due to too small values of 
the parameter c in (PSJ, the procedure described above can be implemented 
replacing i5 in (12 SH by 5 — 1. This change of variable ensures c > 1. 



5 Experiments and Results 

In this section we present some results obtained by applying our method for 
solving problem OOl. The goal is to investigate its performance with respect to 
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the linear criterion. To evaluate the algorithm, we generated image data from 
different 3D point sets and with different camera motions, and also applied the 
algorithm to real image data from standard sequences. In both cases, we scaled 
the image data in order to work with normalized data. 

In the sequel, we will refer to the estimate of the fundamental matrix given by 
the linear criterion with T); to the estimate provided by our method, constrained 
least-squares criterion, with Fds', and to the estimate provided by the distance 
to epipolar lines criterion with when initialized by Fi and with Fd when in- 
itialized by Fcis- The algorithm we use to compute F^is is summarized below. 



Algorithm for computing F^s 

1. Given the point correspondences u[,Ui, form the polynomials h{X) 
and d{\) as shown, respectively, in (I2till and (24). 

2. Build a symmetric matrix function Y{c) satisfying (1.4 till . 

3. Solve the sequence of LMI problems (EJ. 

4. Retrieve A* as shown in 114,511 and check for its optimality. 

5. Retrieve /* as shown in Il4till and form Fds- 



First, we report the results obtained with synthetic data. The points in 3D space 
have been generated randomly inside a cube of size 40cm and located 80cm from 
the camera centre. The focal length is 1000 pixels. The first image is obtained 
by projecting these points onto a fixed plane. The translation vector t and the 
rotational matrix R are then generated randomly, obtaining the projection ma- 
trix P of the second camera and, from this, the second image of the points. The 
camera calibration matrix is the same in all the experiments. In order to use 
corrupted data, Gaussian noise has been added to the image point coordinates 
(in the following experiments we refer to image noise as the standard deviation 
of the normal distribution). These experiments have been repeated fifty times 
and the mean values computed. The weighting coefficients Wi in (EJ have been 
set to 1. 

In the first experiment, a comparison of the mean algebraic error Ca defined as: 



for the linear criterion (F)) and the constrained least-squares criterion (F’cZs) has 
been performed. The goal is to show that imposing the rank constraint a priori 
gives very different results with respect to setting the smallest singular value to 
zero. Figure Elshows the behaviour of the logarithm of Ca for the two methods. In 
the second experiment we consider the properties of estimated epipolar geometry. 
Specifically, we compare the mean geometric error Cg defined as: 




(47) 



1 





1 1 




2 



(48) 
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Fig. 2. Logarithmic algebraic error (log^g(ea)) for linear (dashed) and constrained 
least-squares criterion (solid). 



that is the mean geometric distance between points and corresponding epipolar 
lines. Figure El shows the behaviour of the logarithm of Cg for linear and con- 
strained least-squares criterion. As we can see, the error Cg achieved by Fds is 
clearly less than the one achieved by F[ for all image noises considered. Now, 




Fig. 3. Logarithmic geometric error (logj^Q(eg)) for linear (dashed) and constrained 
least-squares criterion (solid). 



let us see the results obtained with real data. Figure E| shows two typical views 
used to estimate the fundamental matrix. The forty point correspondences are 
found by a standard corner finder. Table 1 shows the geometric error Cg given 
by linear and constrained least-squares criterion, and by the distance to epipolar 
lines criterion initialized by Fi (F^) and by F^s (Fd)- As we can see, the geome- 
tric error achieved by Fds is clearly less than the one achieved by F). 

Table 2 shows the geometric error obtained for the views of figure 0 Here, the 
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Fig. 4. King’s College sequence (epipoles outside the image). The epipolar lines are 
given by Fd after optimization using Fds solution as the starting point. 

Table 1. Geometric error eg obtained with the image sequence of figure 0 



Criterion 


Geometric error Cg 


Fi 


1.733 


Fcls 


0.6578 


Fd 


0.6560 


Fd 


0.6560 



point correspondences used are 27. Again, F^is achieves a significant improve- 
ment with respect to F). 



Table 2. Geometric error for the image sequence of figure El 



Criterion 


Geometric error Cg 


Fi 


1.255 


Fcls 


0.6852 


Fd 


0.5836 


Fd 


0.5836 



Finally, table 3 and table 4 show the geometric error obtained for the well 
known examples used in |2j and shown in figures El and 0 The point correspon- 
dences used are 100 for the first example and 128 for the second one. Observe 
that this time, not only does F^s achieve a smaller geometric error than Fi , but 
also Fd produces a better result than F^, indicating the presence of different lo- 
cal minima. Moreover, in the calibration jig example, F^is provides better results 
even than Fd- 
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Fig. 5. Cambridge street sequence (epipoles in the image). 
Table 3. Geometric error for the views of figure 



Criterion 


Geometric error Cg 


Fi 


0.4503 


Fcls 


0.4406 


Fd 


0.1791 


Fi 


0.1607 



As we can see from the results, the solution provided by our method gives 
smaller algebraic and geometric errors with respect to the linear criterion, with 
both synthetic and real data. Moreover, initializing non-linear criteria with our 
solution allows us to achieve more accurate estimates of the fundamental matrix. 

6 Conclusions 

In this paper, we have proposed a new method for the estimation of the funda- 
mental matrix. It consists of minimizing the same algebraic error as that used in 
the linear criterion, but taking into account explicitly the rank constraint. We 
have shown how the resulting constrained least-squares problem can be solved 
using recently developed convexification techniques. Our experiments show that 
this method provides a more accurate estimate of the fundamental matrix com- 
pared to that given by the linear criterion in terms of epipolar geometry. This 
suggests that our estimation procedure can be used to initialize more complex 
non-convex criteria minimizing the geometric distance in order to obtain better 
results. 
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Fig. 6. Oxford basement sequence (epipoles in the image). 
Table 4. Geometric error for the views of figure Q 



Griterion 


Geometric error eg 


El 


0.4066 


Eels 


0.1844 


Ed 


0.1943 


Ed 


0.1844 
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Abstract. This paper presents a 3D active contour model for boundary 
tracking, motion analysis and position prediction of non-rigid objects, 
which applies stereo vision and velocity control to the class of deforma- 
ble contour models, known as snakes. The proposed contour evolves in 
three dimensional space in reaction to a 3D potential function, which 
is derived by projecting the contour onto the 2D stereo images. The 
potential function is augmented by a velocity term, which is related to 
the three dimensional velocity field along the contour, and is used to 
guide the contour displacement between subsequent images. This leads 
to improved spatio-temporal tracking performance, which is demonstra- 
ted through experimental results with real and synthetic images. Good 
tracking performance is obtained with as little as one iteration per frame, 
which provides a considerable advantage for real time operation. 



1 Introduction 



Deformable contours have been established in the last decade as a major tool for 
low level vision functions, including edge detection, segmentation and tracking of 
non-rigid objects. Boundary detection and tracking based on deformable planar 
contours, known as snakes, were originally introduced by Kass, Witkins and 
Terzopoulos m- Energy-minimizing active contours are deformable contours 
that move under the influence of image-induced potential, subject to certain 
internal deformation constraints. The contour dynamics may be specified by the 
Euler-Lagrange equations of motion associated with the contour potential. Using 
the image gradient as the potential function, for example, results in edge-seeking 
deformation forces, leading the contour towards high contrast boundaries, thus 
performing boundary detection. Many variations and extension of the original 
snake model have been reported in recent literature, see 1 31 1 314191 and references 
therein. 

In this paper we present an active contour model in three dimensional space, 
which is intended to for temporal tracking of the contour of three dimensional 
objects in stereo image sequences. 
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Contour tracking in a sequence of images may be performed by initializing 
the snake in each image using its last position in the previous frame. A significant 
improvement on this basic scheme may be obtained by incorporating inter-frame 
velocity estimates to improve the initial positioning. By starting the snake closer 
to the actual object boundary, we obtain improved convergence and reduced 
likelihood of trapping in local minima. Velocity information may be obtained by 
extrapolating the contour motion from previous images, using Kalman filtering 
or related methods; see, e.g., [1 SI4^ . Another approach, more relevant to the 
present study, is to extract an estimate of local or global velocity from two 
adjacent images. In the authors use correlation and rigid motion models to 
estimate the global motion of the object between the two frames. The work 
reported in proposes an explicit calculation of the local velocity field along 
the detected contour, using optical flow methods, which is then used to predict 
the next position of the contour. A related idea PI is to use a potential term 
related to the temporal difference between images, in order to focus the tracking 
contour on regions of temporal change, as an indicative of object motion against 
a stationary background. 

Recently, and integrated spatio-temporal snake model was proposed for 2D 
contour tracking in combined spatio-velocity space HM7I . The velocity snake 
model uses the optical flow constraints in order to propagate the contour in the 
direction of the local velocity field. Explicit computation of the optical flow is 
avoided by incorporating the optical flow constraint equation directly into the 
snake dynamics equations. The improved tracking performance of the model 
was demonstrated theoretically by treating the image sequence as continuous 
measurements along time, where it was shown that the proposed model converges 
with no tracking error to a boundary moving at a constant velocity |1 ti] . This is 
in contrast to the basic snake model which is biased due to lack of image velocity 
input. The present paper generalizes these ideas to 3D contour tracking of shape 
and motion in stereo images. 

Deformable contours are well suited for stereo matching, as they can perform 
simultaneously the two tasks of feature detection and correspondence. This was 
originally realized in ll fll| . which proposes to use different 2D contours in each 
of the stereo images, coupled by a stereo disparity smoothness term in their 
respective potential functionals. A deformable spline model for stereo matching 
that evolves in 3D was proposed in j^, which employs an additive potential 
function based on the projections of the curve on each image. A similar idea will 
be used in the tracking scheme proposed here. In the authors propose an 
affine epipolar geometry scheme for coupling pairs of active contours in stereo 
images to enhance stereo tracking of 3D objects. 

The active contour model proposed here is a parameterized curve which evol- 
ves in three dimensional space under the influence of a three dimensional poten- 
tial function. We use an additive potential which is derived by projecting the 
3D contour onto the the two stereo images, where 2D gradient potentials are 
defined. An alternative multiplicative potential is also considered. The basic po- 
tential is augmented by a velocity term related to the optical flow in each of the 
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images. This term provides an additional force which approximately tracks the 
actual contour motion. The resulting dynamic equation of the tracking contour 
is realized in discrete time and space, with a single iteration step per frame. 

It should be mentioned that the methods used in the paper are easily scalable 
to more than two cameras, as well as general geometric configurations. 

The rest of the paper is organized as follows. In the next section we briefly 
describe the basic velocity snake model in two dimensions. The complete three 
dimensional model is presented in Section 3, followed by a discussion of some 
implementation considerations in Section 4. Experimental results are presented 
and discussed in Section 5, followed by some concluding remarks. 



2 The Velocity Snake 



In this section we present the basics of the two dimensional Velocity Snake propo- 
sed in ca Consider the closed contour v{s, t) = (x(s, t ) , y(s, t)) for a parametric 
domain s € [0, 1] and time t € [0, oo). The snake Lagrangian, originally proposed 
by Terzopoulos et. al. CHI, is given by: 

L=^y {w i\v + W 2 \vss\‘^) ds - ^ J P{v)ds (1) 

The first term is the kinetic energy held, where ^(s) is the mass of the 
snake. The second term defines the internal deformation energy where wi{s) 
controls the tension and W 2 {s) controls the rigidity of the snake, respectively. 
The third term is the potential held energy of the contour, imposed by the 
image. The potential energy may be derived from a single (fixed) image, in 
which case the snake will converge to a static shape, or it may be derived from 
a temporal sequence of images, for tracking purposes. In the latter case, the 
potential becomes a function of time (c.f. CZ]). In this paper, we consider for 
concreteness the common case of edge oriented potential: 



P{x,y,t) = -c|V [Ga * I{x,y,t)]\ 



( 2 ) 



This will cause the snake to be attracted to edges. The function I{x, y, t) denotes 
the brightness of pixel {x, y) at time t in the image, and G„ depicts a Gaussian 
Altering window with variance a. The energy dissipation function which is used 
in conjunction with the Lagrangian CD to describe the effects of nonconservative 
forces is defined here by: 



D{vt,v\) 




{vt — v\) 11^ ds 



2 




ds 



(3) 



where L is a real matrix and and v\ denote the boundary position and ve- 
locity, respectively. The second term represents a smoothness constraint. Using 
the above Lagrangian ([Q) and the dissipation function 0, the Euler-Lagrange 



254 R. Zaritsky, N. Peterfreund, and N. Shimkin 



equations of motion of the velocity snake are given by: 

+ -^{w2Vss) = -'^P(v{s,t) , t) (4) 

where: c(vt,Vf) = LL'^(vt — v^). Since the boundary velocity of the object is 
unknown in advance, we use instead the apparent velocity (optical flow) vl of 
the image at the contour position. 

A remarkable special case exists when L = (V/(r;(s,t))) and Vt is replaced by 
the apparent velocity. In this case we obtain the integrated optical flow model 

m-- 

d f d \ d 

fivtt + + It) ~ 

+ -^{w 2 Vss) = -VP{v{s,t ) , t) (5) 

This result is due to the optical flow constraint equation [9] : 

+lt{x,y,t) = 0 (6) 

Compared to the general model ©, the model presented by does not 
require an explicit estimation of the image velocity. 



3 The 3D Velocity Snake Model 

Given stereo image sequences, the present paper extends the two dimensional 
velocity snake to object tracking in three dimensions. The required 3D potential 
function is extracted at each point in time from the two stereo images. The 
velocity term serves to improve the quality of the 3D tracking. In the following, 
we briefly present the relevant 3D to 2D relations. The 3D space- velocity tracking 
model is then derived. 



3.1 3D Projections 

We assume here the following layout of the stereo cameras. The two cameras 
are parallel and positioned at the same height. The distance between the two 
cameras is b, and between the focus and the image plane is /. The focal point 
of each camera is positioned at the center of each image. The world coordinate 
axes is in the plane which includes the focal points of the two images and is at 
a distance of b/2 from each focal point. 
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Consider a point P = {X,Y, Z) in world coordinates. In the left and right 
images the point will be projected onto pi = {xi, y) and Pr = (xr, y) respectively. 
We have 



Xi X + b/2 Xr X — bl2 

7 ^ ^ 7 ^ ^ 

El - ^ ^ 
f ~ f ~ z 



( 7 ) 



For the velocity snake we will also require the equations linking the three di- 
mensional velocity and the two dimensional image velocity. Differentiating equa- 
tion o in time we obtain: 



where: 
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( 8 ) 
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3.2 The 3D Tracking Model 

We define the snake as a 3D parametric contour V (s, t) that depicts the points 
of the contour in world coordinates at time t, V{s,t) = [X{x,t), Y{s,t), Z(s,t)], 
where s S [0, 1] and t G [0, oo). We will define the Lagrangian energy of the 3D 
snake similar to the 2D case as: 

7^77 + W 2 \Vssf) ds - ^ P{V)ds (9) 

The first term of equation O is the 3D kinetic energy field (where p{s) is the 
mass of the snake) . The second term defines the 3D internal deformation energy 
where wi(s) controls the tension and W 2 {s) controls the rigidity of the snake. 
We note that in this model, we do not include the torsion term (which exists in 
three dimensions, but not in two dimensions) in order to simplify the resulting 
equations. Our experimental results have indicated that such term does not 
contribute to the tracking performance. 

The third term in equation o is the potential energy field. The 3D poten- 
tial energy is obtained by an additive combination of the stereo potential fields 
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Fig. 1. Synthetic stereo-image-pair of a 3D cubic 



evaluated at the projection of the 3D position onto the image planes: 



P = a* Pi 



X{s) + b/2 Yi^ \ 
Z{s) 



fi*Pr 



( X{s)-b/2 

V Z{s) 




( 10 ) 



Here Pi and Pr are the left and right 2D image potentials, respectively. 

The desirable property of this potential function is that it has local minima 
at the combined local minima of the two component potentials, which in turn 
correspond to the local minima of the boundary edges in 3D space. While this 
may be the simplest combination with this property, other reasonable combi- 
nations exist. For example, an alternative potential to the one in PH is the 
multiplicative potential 

P=-a/^ (11) 

(note that all potentials here are assumed negative). Let us briefly discuss and 
compare the properties of these two potentials. While both potential functions 
obtain local minima at the boundary-edges of the 3-D object, their behavior in 
the vicinity points may differ, depending on image content and clutter. 

Figures OSprovide an illustration of these two potential functions. We use a 
synthetic stereo image pair of a 3-D cubic given in Figure Q Figures Eland 0show 
the computed values of the potential functions (IIUI) and (lilt at Y=0, as functions 
of the 3D coordinates X and Z . Note that the absolute value of the potential is 
depicted, so that minima of the potential correspond to maxima in these figures; 
we shall refer in the following discussion to the magnitude of the potential. 
Both functions are seen to have four major local maxima, corresponding to the 
true locations of the four cubic edges intersecting F = 0. Consider the additive 
potential (1 1 1 )ll first. According to Figure El each of these maxima is formed by 
the intersection of two ’’ridges”, each contributed by the the potential energy 
term of a single image. The ’’ridge” is formed along the epipole line intersecting 
the image at the object’s edge, and due to the additive nature of this potential 
it leads to a considerable magnitude of the overall potential (and possible local 
maxima) even off the intersection with the other ridge. These spurious maxima 
disappear in the case of the multiplicative potential function m, as illustrated 
in Figure El where the potential will be whenever one of its components is small 
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Y cul, y=0 




Fig. 2. The additive potential function G3) of the stereo-pair in FigureQalong the co- 
ordinate Y=0, as a function of the 3-D coordinates X and Z. The magnitude (absolute 
value) of the potential is depicted. 



enough. While this quenching of the potential off the intersection points seems 
to lead to a smoother function and possibly decrease the chances of entrapment 
in local extrema, the additive potential may posses some advantage when the 
deformable contour is far from the desired position as the information provided 
from one image can guide it to the correct position. Moreover, in the presence 
of partial occlusion in one image of the stereo-pair, the additive potential m 
provides partial support for tracking based on the contribution of a single image 
to the potential field, while the multiplicative potential field dm is nulled. In 
the experiments performed within this work, where fairly close tracking was 
maintained, the two potentials yielded similar performance. 

In analogy to the 2D case we propose the 3D dissipation function given by: 



D{VtX) = l j\\Lj{Vt-Vl>)\\^ds 



/? /■! 




2 Jo 


ds 



ds 



( 12 ) 



where 7 and j3 are positive scalars, and denote the 3D boundary position 
and velocity, respectively, and Lf and Lj. are real matrices which will later be 
defined as functions of the left and right image data. The third term represents a 
smoothness constraint. The first two terms provide the velocity tracking term in 
3D space. In the following we show how these terms may be computed without 
explicit computation of . 
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Y cu1, y=0 




Fig. 3. The multiplicative potential function (Hill of the stereo-pair in FigureQ along 
the coordinate Y=0, as function of the 3-D coordinates X and Z. 



Using the Lagrangian (0) and the dissipation function II12I1 . the Euler-Lag- 
range equations of motion are given by: 

fJ,Vu + 7 [CiiVt, U/) + Cr{Vu U/)) - Ut) . . . 

- — (u;iU«) + ^{w 2 Vss) = -VP{V{s,t) , t) (13) 

where: 

Ci{VtX) = LiLj{Vt-V^), 

Cr{VtX) = LrL^{Vt-V^) 

An explicit computation of the 3D object velocity is obviously a deman- 
ding task. This may be circumvented by defining the following image dependent 
weight matrices Lj — VlfHi and = XI^Hr where H( and H are the pro- 
jection matrices in (0) and le and Ir denote the left and right images. This 
gives: 

Ci{VuV^) = HjVh{VljHiVt - yijHeVt^) 

Cr{Vt, U/) = Hj VA(V/J HrVt - V/J HrV^^) (14) 

The term is equal to where v\ is the apparent motion of 

the image boundary. Approximating v\ by the apparent motion of the image v\ 
at the snake position in each image, and then using the optical flow constraint 
equation this term reduces to —It- Ci and Cr then become: 

Ci{Vt, U/) = Hjvh {VijHiVt + {h)t) 

Cr{Vu u/) = HjVIr {VijHrVt + (A)i) 



(15) 
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The resulting Euler-Lagrange equation given by m and (ESI) is easily im- 
plemented by direct discretization. 



3.3 Space-Time Discretization 

We briefly describe in this subsection the discretization in space and time of the 
equations of motion of the 3D velocity snake. The continuous model is trans- 
formed into simple discrete-time equations with nonlinear image-based inputs 
through conventional finite difference techniques. Consider first the space discre- 
tization of the equations of motion as proposed above. Let U = [U ^, . . . , \ 

be the equidistant sampling vector of the contour V (s) with a sampling distance 
oi h = 1/N. Using a simple finite difference approximation for the space deri- 
vatives, and substituting (ESI) into m, we obtain the discrete version of the 
equations of motion: 

MUu + 7 {HjVh {VijHtUt + i,h)t) + HjVI, {VijH.Ut + (/.)*)) 

-b PDD'^Ut + KU = -VP{U) (16) 

where K is the deformation matrix defined in PHI, D is the “derivative” matrix 
defined in HZ] and M is the mass matrix. For brevity we omit the details here. 
For comparison, a 3D tracking contour which employs the approach suggested 
by the original snake model (as outlined e.g. in PHI) does not have the velocity 
control and smoothing terms, and is given by 

MUtt + -fUt + KU= -VP{U) (17) 

The discretization in time of m is again performed by straightforward central 
difference approximation. The sampling interval T is taken as the video inter- 
frame interval. 

4 Experimental Results 

We demonstrate the performance of the proposed three dimensional velocity 
contour model by applying it to simulated and to real image sequences. We 
first show the tracking capability of the three dimensional velocity snake TO, 
applied a synthetic cube sequence, and compare the results to that of the three 
dimensional active contour (HZJ which doesn’t employ the velocity control term. 
As was noted, the latter is a direct generalization of the Kalman Snake of Szeliski 
and Terzopoulos PH] to the three dimensional space. The tracking capability is 
then demonstrated on two real stereo-image sequences with rigid motion (a book) 
and with nonrigid one (a hand). 

Prior to the calculation of the image gradients and derivatives, the image 
sequences were smoothed in space by a Gaussian Alter with a — 2, for the 
potential computation, and with cr = 5, for the velocity measurements. The 
sequence for the velocity measurements was also smoothed along time. Note 
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Regular Snake 





Fig. 4. Tracking results of the synthetic cube in FigureQ]with the regular snake (1171 
and the velocity snake (1 1611 . at the initial frames (top), and the 50th and 97th frames. 
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that the amount of smoothing was based on prior knowledge on the bounds on 
image motion. The spatial derivatives of the images were calculated by applying 
a simple 3x3 Prewitt operator, and the time derivative by simple subtraction 
of values of successive images, divided by the frame period. 

The contour models (m and (EJ were chosen with the additive potential 
function m with P — — ||V/|| and with a = j3 — 0.5. The results with the 
multiplicative potential model (HU were similar. The parameters of the contour 
models were chosen empirically based on multiple runs and on the resulting 
performance. This ’’pseudo learning” could be done automatically via parameter 
learning (e.g. the auto-regressive modeling and learning proposed in 0). 

Numerical Considerations'. For the regular three dimensional snake (II 711 , the 
contour position at each sampled image was computed based on 100 iterations. 
This number was chosen in order to allow the snake to stabilize on a solution 
without too much overhead in time. This heavy computation is in contrast to 
the reduced complexity provided by the velocity snake ( 11 61) which employs only 
one iteration per image sample. 

In order to fully understand the tracking performance of the proposed 3D 
active contour, the tracking was performed on a synthetic stereo image sequence 
of a synthetic cube. We simulated a three dimensional cube 40 x 40 x 40 moving 
in the positive Z direction (towards the camera) with a velocity of 25 units/sec. 
The cube was initially positioned at a distance of 500 units in the Z direction. 
The two dimensional stereo-image sequence was formed by using perspective 
projection with a camera focus of f=500 and with a grayscale of 64 colors. Each 
cube face was painted with a different color (grayscale level of 15, 40 and 63 
respectively) with the background painted in black (grayscale level of 1) (see 
Figure 4). The image sequence was made up of 97 images, with a sampling 
interval along time taken as the inverse of frame rate T=l/25 sec. 

Contour Parameters'. We used contour models with a three dimensional spa- 
tial sampling distance of 4 units. The model (HU was used with /i = 1, rcl = 5, 
w2 = 0.1 and 7 = 100 and the three dimensional velocity snake with /i = 1, 
rcl = 5, w2 = 0.01, 7 = 0.01 and f3 = 1000. In addition, the potential was 
multiplied by a factor of 100 in order to allow the image to form the proper 
attraction force. 

Tracking Results: The tracking results of the moving cube are shown in Figure 
4. We show samples of the image sequence at the 50th and the 97th frames. The 
corresponding tracking results of the regular snake HU and of the velocity snake 
(HU are shown below. It can be seen that both models successfully track the cube 
throughout the sequence. The tracking of the regular three dimensional snake 
shows two major flaws: Bunching of the sampling points towards the center of 
the cube’s sides. This happens because there is no correspondence information 
between points in successive images. The second is that the contour lags after 
the synthetic cube. This is caused by the fact that the snake samples have no 
prior information on the velocity of the object being tracked instead they have 
to settle onto each new image. Both of the above flaws are solved by adding the 
velocity term into the three dimensional snake. This term gives an estimate of 
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Fig. 5. Samples of a book tracking video sequence (1st frame and 101th frame) with 
the position of the three dimensional velocity snake projected onto them. 



the objects velocity vector between subsequent frames which stops the bunching 
of the snake samples and the lagging behind the tracked contour. Note that using 
only one iteration per frame for the model resulted with loss of tracking 
after few frames. 



Next we present the results of tracking a book with the regular model (II YU 
and with the velocity snake (II 611 . The book had the dimensions of 24 x 21 x 6 
cm. and was moving towards the camera (-Z direction) from a distance of 3.1 
meters to 2.3 meters (2.7 meters only for the regular snake). The image sequence 
was comprised of only 51 images for the regular three dimensional snake (we 
stopped after 51 frames because of poor tracking) and 101 images for the three 
dimensional velocity snake. The stereo images were captured using a stereo setup 
- two identical synchronized cameras positioned 58 cm. apart in parallel. The 






Tracking of 3D Deformable Contours 263 







Regular Snake Velocity Snake 



Fig. 6. The 3D snake contours corresponding to images in Figure|3 



camera focus was f=476. The images had a grayscale of 256 colors and were 
digitized on two Silicon Graphic workstations at a rate of 25 frames per second. 

Contour Parameters: We used the contour models with a three dimensional 
sampling distance of 2.5 cm. The model (H2J was used with /i = 1, wl = 5, 
w2 = 0.01 and 7 = 10. For the three dimensional velocity snake, we used = 1, 
tcl = 1, w2 — 0.01, 7 = 0.001 and /3 = 2000. For the regular three dimensional 
snake, the potential was multiplied by a factor of 10, while a factor of 5 was used 
for the three dimensional velocity snake. 

Tracking Results: The results of tracking of the book in 3D are shown in 
Figures |5| and 0 We show the results only up to the 51’th frame as the results 
of the model (ll7ll were very poor following that frame. The velocity snake gave 
precise results along the 100 frames of the sequence. 

In Figure 0 we show the results of tracking a moving hand with the velocity 
snake (ESI). The model couldn’t handle this sequence. In this sequence we 
wished to demonstrate the robustness of the three dimensional velocity snake in 
a long image sequence (about 14 seconds) which included changes in velocity of 
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Fig. 7. Tracking results of a moving hand using the 3D velocity snake at the 1st, 100th 
and 225th frames. 
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the object being tracked (forward and backward motion) as well as a change in 
the object’s shape (the hand turns sideways at the end of the sequence). The 
sequence was comprised of 345 images of a hand moving towards and away from 
the cameras from a distance of 3.4 meters to 2.4 meters. This sequence shows the 
robustness of the three dimensional velocity snake. The hand moved backwards 
and forwards as well as to the sides, with the three dimensional velocity snake 
keeping track throughout the sequence. There was a slight loss of track when 
the motion changed directions, but the three dimensional velocity snake quickly 
caught up to the object and continued tracking (the loss of track was caused 
by the fact that there is no acceleration prediction, only velocity prediction). 
This sequence also showed the three dimensional velocity snake’s capability of 
tracking changes in shape of the three dimensional object. 



5 Concluding Remarks 

We have considered in this paper the problem of 3D tracking using stereo se- 
quence data. The tracking scheme incorporated 3D active contours and optical 
flow velocity information was introduced to improve the tracking performance. 
A particular choice of the relevant parameters, which greatly reduces the com- 
putational requirements was presented. The experimental results which use this 
scheme indicate successful tracking of simulated and real scenes and clearly de- 
monstrate the performance improvement associated with the velocity term. 

The results of this paper clearly demonstrate the feasibility of of the pro- 
posed approach for real time 3D contour tracking. Additional work is required 
in order to optimize the computational load and improve numerical stability. 
The incorporation of similar ideas in the context of geometric snake models and 
related level set approaches is an interesting area for future research. 
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Abstract. Given a 3D object and some measurements for points in this object, it 
is desired to find the 3D location of the object. A new model based pose estimator 
from stereo pairs based on linear programming (lp) is presented. In the presence 
of outliers, the new lp estimator provides better results than maximum likelihood 
estimators such as weighted least squares, and is usually almost as good as robust 
estimators such as least-median-of-squares (lmeds). In the presence of noise the 
new LP estimator provides better results than robust estimators such as lmeds, 
and is slightly inferior to maximum likelihood estimators such as weighted least 
squares. In the presence of noise and outliers - especially for wide angle stereo - 
the new estimator provides the best results. 

The LP estimator is based on correspondence of a points to convex polyhedrons. 
Each points corresponds to a unique polyhedron, which represents its uncertainty 
in 3D as computed from the stereo pair. Polyhedron can also be computed for 2D 
data point by using a-priori depth boundaries. 

The LP estimator is a single phase (no separate outlier rejection phase) estimator 
solved by single iteration (no re-weighting), and always converges to the global 
minimum of its error function. The estimator can be extended to include random 
sampling and re-weighting within the standard frame work of a linear program. 



1 Introduction 

Model based pose estimation is a well studied problem in computer vision and in photo- 
grametry, where it is called: absolute orientation. The objective of the problem is finding 
the exact location of a known object in 3D space from image measurements. 

Numerous references regarding pose estimation appear in the literature, see itiTll 

EOl. 

Pose estimation problem consists of several sub-problems, including: 

1 . Feature type selection - points and lines are commonly used m3- 

2. Measurement type selection - 3D features to 3D features, 2D features to 3D features 
or a combination of 2D features and 3D features 
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Israel Science Foundation Grant 612/97. Contact: {moshe , peleg, werman}® cs .huj i . ac . il 
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3. 



Estimator selection - least squares, kalman filter, hough transform 



miiiiiHu 



This paper is focused on the estimator part for measurements obtained by a stereo 
head. The proposed estimator is based on point to polyhedron correspondence using 
linear programming. The estimator was tested for non-optimal conditions including 
strong non-Gaussian noise, outliers, and wide field of view (up to ±80°). The proposed 
linear-programming (lp) estimator is compared to the following estimators: 



1. Least squares - ±2 based estimator (lsq). 

2. Weighted least squares (wls) using covariance matrices. 

3. Least absolute differences - Li based estimator (lad) 

4. Least median of squares (lmeds or lms) 

5. TUKEY BI-WEIGHT M-ESTiMATOR which uses the LMEDS Solution as its initial guess. 

6. A variant of the tukey m-estimator that uses covariance matrices as well. 

Next we briefly describe each of the above estimators. Section 0 describes the pro- 
posed LP estimator. SectionSdescribes the testing procedure and test results. SectionE] 
describes future enhancements of the proposed estimator. The following notation is used: 
given two sets of points in 31?, {Mi} - the model points and {Pi} - the measurements 
we want to find a rigid 3Z? transformation: a rotation matrix R and a translation vector 
T that minimizes the distance between (T ± RP) and M. 



1.1 Least Square Pose-Estimator 

The error function is 11^ + ~ To find the transformation we define Si = 

Pi — P, Qi = Mi — M were P, M are the averages of {Pi} and {Mi} respectively. 
If UEV^ = SVD(SQ^) then the rotation matrix R is given by i? = UV^ and the 
translation vector T is given by T = M — RP. See t6J- Since we would like to use a 
common framework for our comparison for non least-squares estimators as well, we use 
the following three stage algorithm and present its result for lsq as well. 

1. We seek an affine transformation F, (3 x 4) which minimizes some error function. 

For LSQ the error function is: II ~ PPi)\\ 2 - For lsq F is recovered by 

F = A)'^A^b where: A — [Mj , . . . M^]^, b = P^ and ’#’ denotes pseudo 

inverse. 

2. The rotation part of F denoted as F^ is projected onto the closest (in the least-squares 
sense) rotation matrix R = UV^ where U SV^ = SVD{Fr). The determinant of 
R is checked to eliminate reflections (det{R) = — 1). The step is common for all 
estimators. 

3. The translation is then recovered. For lsq by T = M — RP. For other estimators 
by warping the model according to R and then solving for translation. 
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Fig. 1. Stereo uncertainty polyhedron - The shape of each polyhedron is a function of the point-rig 
geometry and the radius of error in the image plane. 



1.2 Weighted Least Squares Estimator 

In WLS we seek an affine transformation F that minimizes: ~ AF)^W{b — AF). 

F is recovered hy F — {A^ W A)"^ Wh. Where W, the weight matrix is a block 
diagonal matrix, each block is a 3 x 3 covariance matrix Wi of the corresponding pair 
(Mi, Pi). Wi is computed as follows: for Pi,Qi - the projection of the (unknown) 3D 
point Pi onto the two stereoimages, we use the bounding rectangles [(pi)xFr, (pi)y±r], 
\(qi)x±r, (qi)y±r\ , to compute a 3D bounding polyhedron D for Pi. The weight matrix 
Wi is taken as the covariance matrix of the eight vertices of Di with respect to X, Y , and 
Z coordinates. See Fig. Q The radius r is a tunable parameter which will be discussed 
later in the article. The shape of the polyhedron is dependent upon point location with 
respect to the stereo head and the image uncertainty radius. This shape varies from point 
to point and it is not symmetric even though the image uncertainty radius r is symmetric. 
For this reason a simple least squares does not provide the best estimation. 

1.3 Least of Absolute Differences Estimator 

In LAD we seek an affine transformation F that minimizes the error function: \\(Mi — 

FPi)\\i. F is recovered by solving the following linear programming problem: 
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Min: + )^ + irt)y + {ri )y + {rf):, + {r^ (1) 

i 

Subject to 

{M,-FPi)^ + (r+), - (r-), = 0 
(M, - FP,)y + {rt)y - {r-)y = 0 
{M,-FP,), + (r+),-(r-),=0 
r+ , r- > 0 

In linear programming all variables are non-negative, therefore a real typed variable 
X is represented by a pair of non negative variables: x = (a:"*" — x~). This holds for the 
elements of F as well (not explicitly shown) and for the residual error slack variables 

{fi)x,y,z- 

At the global minimum point we get either r'^ = 0orr~ = 0 for each pair, and hence 
-f ) is the sum of least absolute differences jijj. lad has been successfully used 
for motion recovery in the presence of noise and outliers, however since the lad error 
function is symmetric, and the uncertainty shape in 3D is not, the lad is not fully 
suitable to the pose estimation problem, lad, like other M-estimators, has a breakdown 
point of zero due to leverage points, therefore it is not considered a robust estimator. 
In the experiments we used strong outliers - up to the maximum possible range within 
image boundaries, trying to break the lad estimator by leverage points. Although we 
managed to break it down when the number of outliers exceeded a certain point (above 
40%) we did not see leverage point symptoms. This result complies with earlier results 
[*]. We believe that this is due to the fact that the error was bounded by the image size. 



1.4 Least Median of Squares Estimator 

In LMEDS we seek an affine transformation F that minimizes: median{\\Mi — FPi\\ 2 ). 
LMEDS is a robust estimator in the sense of its breakdown point which is 0.5 - the largest 
value possible. Unfortunately, deterministic algorithms for lmeds have exponential time 
complexity. To solve this problem a probabilistic algorithm by random sampling is used: 
several model estimations are recovered by different random samples. The residual error 
is computed for each estimation and the estimation with the lowest median of squares 
is selected as the lmeds estimation. This probability of the algorithm’s success is: 



P=l-[l-{l-q)T ( 2 ) 

where: q is the probability of choosing an outlier, k is the number of guesses needed 
for model recovery and n is the number of iterations. However, this analysis assumes 
that the data points are dichotomic - each data point is either an outlier, or a perfectly 
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good inlier. This assumption does not hold in the presence of noise and it affects the 
accuracy of the algorithm. Note that the time complexity of the prohahilistic algorithm 
grows exponentially with k, thus trying to reduce the noise influence hy enlarging the 
support is very costly. See ll1‘)l1hl‘SI1'2fra. 

1.5 Tukey Bi- weight M-Estimator 

In TUKEY M-ESTIMATOR we Seek an affine transformation F that minimizes: 



E,p(||M,-FP,||M) (3) 

ff[l-(l-(^n3] 1^1 <6 

PW = < „ 

Iv 1^1 

Where p{u) is the loss function, 6 is a tuning parameter and <Ji the scale associate 
with the value of the residual error ri^p = \\Mi — FPi\\ of the inliers. EquationElis 
often solved by an “iterative re-weighted least-squares” [|TT1 with the following weight 
function: 



w{u = Tipja) = 'ip{u)/u 
= p{u)' = 



(4) 



\u\<b 



^0 |m| > 6 

Where ct is a scale estimate. The following scale estimation was used in the test 




i=l 



Win,F 

N 



(5) 



The initial guess of Wi and the scale estimation were obtained using the lmeds so- 
lution. b was set to 4.8. A variant of the tukey m-estimator that was found use- 
ful for the non-symmetric noise distribution was the combination of the tukey m- 
ESTiMATOR Weights function w with the covariance matrix W by using Wdiag{w) as 
the new weight function. See | 



lEianB! 



2 The Proposed lp Estimator 

The uncertainty of each image point in space is a cone in 3D. The vertex of the cone in 
located at the camera center, and the intersection of the cone with the image plane is the 
image uncertainty circle (or ellipse). The image uncertainty circle can be approximated 
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by a simple polygon producing a SD ray. The intersection of two rays from a stereo pair 
is is a convex polyhedron in 3D space. Fig. Eshows the polyhedron obtained by using a 
rectangular uncertainty shape (The uncertainty of a pixel for example). The polyhedron 
shape is a function of point location in the image planes and the image uncertainty 
radius. This setup can also be used to express the uncertainty of point in a single image 
{2D data) - by bounding the polygon with a global bounding cube (the room walls for 
example) or with a near-far planes. 

Let Vi be the vertices of some polyhedron, then any point P within the polyhedron 
can be expressed as a convex combination: 

P = Y,V,S, (6) 

j 

0<S,<1, ^^, = 1 

3 

Given Eq.ElThe mapping of a model point Mi to the (unknown) 3D point Pi by an 
affine transformation F can be expressed using the bounding polyhedron Vi as: 



FM, = J2(ViUSi)j ( 7 ) 

j 

0 < (Si)3 < 1 , = 1 

j 

Plugging Eq.Clinto the lad estimator results in the following linear program, - the proposed 
LP pose estimator. 

Min : ^ (r+):, -f (r“)^ -f (r+)y -f (rT)y -f (r+)z -f (r~)^ (8) 

i 

Subject to 
j 

{FMi% + (r+), - (r-), = 

{FMi)y + {r+)y - (r-), = {Pi)y 

{FMi). + {rt)y - (r-). = {Pi). 

>0, 0<(Si)i<l, Y.^Si)y = l 

j 

The value of the error function of the lp pose estimator is zero error iff F maps all model 
points somewhere within their corresponding polyhedron. There is no preferred point within the 
polyhedron - which is an advantage, especially when coping with systematic bias errors. In the 
case that a model point is mapped outside its corresponding polyhedron, the estimator will select 
the closest (in lad sense) point within the polyhedron as the corresponding point and the Li 
distance between the selected point and the warped model point will be added to the error value. 
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3 Selecting the Radius of Image Uncertainty 

The radius of the image uncertainty circle is used hy the wls estimator as well as hy the the 
proposed lp estimator. A good selection of the radius of uncertainty circle would be one that 
matches the uncertainty of the inliers only - due to noise. It should produce small, yet greater than 
zero error values for the lp estimator, since a zero error may indicate too much freedom that can 
cause inaccurate pose estimation. The radius of image uncertainty circle can be selected by: 

1 . For a given a-priory knowledge of noise parameters, the radius is according to the expectancy 
of the noise magnitude. 

2. If no a-priory knowledge in given - but the accuracy of the estimation can be tested (like in 
RANSAC) then the radius can be determined by binary search over a predefined region. 

3. Otherwise a robust scale estimator such as Eq.j^can be used. 

4 Tests 

All tests were conducted in a simulative environment with known ground truth, known calibration 
and known matching for the inliers. The scale was set so that 3D world measurements can be 
regarded as millimeters and 2D image measurements in pixels. Simulated image resolution was 
500 X 500 pixel, and “automatic zoom” was used to keep the model image span the full pixel 
range. The simulated stereo rig had base line of 150mm, two identical parallel cameras (due to 
the large field of view used) and focal length of 10mm. The field of view in the tests was 45° and 
for the large field of view: 80°. The model consisted of 100 points located on a grid volume of 
4meter^. Model points were centered at the origin and had small additive noise added to them to 
break possible regularities. The pose of the model was randomly selected within rotation of ±10° 
and translation of ±100mm Two basic types of errors were induces: 

Noise - Additive, "small" magnitude (up to 4 pixels), uniform distributed, zero mean noise. 
Outliers - Additive, "large" magnitude (up to full size of image in magnitude), uniform distributed, 
positive (non-zero mean) noise. 

The proposed lp estimator (Tagged as “LP”) was compared to the following algorithms: 

1. Stereo reconstruction without using model. Tagged as “RAW” and given as a reference. 

2. Pose estimation by the least squares estimator. Tagged as “BLS”, The common frame algo- 
rithm for least squares is also presented tagged as “LSQ”. 

3. Pose estimation by the weighted least squares estimator, using covariance matrices. Tagged 
as “WLS”. The covariance matrices were calculated using the same data that was used by the 
proposed lp estimator. 

4. Pose estimation by lmeds estimator. 500 iterations were used to give practically guaranteed 
outlier rejection (but not noise). Tagged as “LMedS”. 

5. Pose estimation by LAD estimator. Tagged as “LAD”. 

6. Pose estimation by tukey bi-weight m-estimator. Tagged as “Tukey”. The lmeds solution 
was used as the initial guess for the tukey m-estimator. 

7. Pose estimation by a variant of the tukey m-estimator that used covariance matrices as 
well, which produced good results in some cases. Tagged as “TWLS”. 

The checking criteria (compared to the ground huth before added noise and outliers) include: 
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1 . Average and maximum absolute error between the ground truth and the warped model points 
according to the recovered pose. (The maximum error was selected for it implication on safety 
consideration in robotics). 

2. Absolute difference of translation vector (for X,Y and Z). 

3. Absolute difference of rotation axis (in degrees), and rotation angle about the axis (meaningful 
only if the difference in the rotation axis is small). 

Each test was repeated several times, with different random selections each time. In cases 
where the results looked alike - the first result appear in the paper. In cases where different random 
selection cause significantly difference in the result - the first representative of each case is shown. 
The best results(s) in each case (judged by warped model errors), the second best result(s) and 
the worst result are marked. The raw stereo data was excluded from ranking as it is given as a 
reference only. 

4.1 Noise Resistance Test 

In this test, additive uniform, zero mean noise was added to both images. TahleQ] shows the result 
for maximum noise amplitude between ... 2 pixels. We can see that: 

1 . Even for the lowest noise level the raw stereo reconstruction has significant error - caused 
by the non optimal setting of the stereo rig. 

2. The LSQ estimator did not provide the best result due to the non-Gaussian distribution of the 
noise. 

3. The WLS and the tukey wls variant estimators provided the best estimate due to use of 
covariance matrices. 

4. The LMEDS estimator usually provided the worst results since all points had additive noise 
and due to its small support. (The wls begins showing good results at about 20 points - 
which is already too costly for lmeds). 

5. The LP estimator provided the second best results. It was the only estimator to be at the same 
order of error magnitude as the best estimators. It is clearly different than the lad estimator. 



4.2 Outliers Resistance Test 

In this test, strong, additive uniform, positive (non zero mean) noise was added to the 
images. The maximum noise level was 500 pixels. Different number of outliers were 
tested, from a single outlier to 50% outliers. Tables|2J[3|show the result for the outliers 
test. We can see that: 

1 . The LMEDS estimator and the tukey m-estimator estimator (with lmeds result 
used as an initial guess) provided the best result. 

2. The LAD estimator provided the second best result, followed by the lp estimator. 

3. The least squares and weighted least squares were completely broken by the outliers. 

4. The influence of the outlier on the covariance matrix overtook the tukey m-estimator 
weights causing the tukey m-estimator variant to fail as well. 

5. The LAD and the LP began to break down at 45%. 
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Noise Level: 




Model Error 


Translation Error 


Rotation Error | 


1/100 Pixel 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


1.809 

0.113 


18.191 

0.302 


0.001 


0.031 


-0.293 


0.004 


0.000 




BLS 


0.111 


0.301 


0.001 


0.031 


-0.293 


0.003 


0.000 


Best >> 


WLS 


0.011 


0.026 


0.009 


0.008 


0.016 


0.003 


0.000 




LMedS 


0.633 


1.874 


-0.660 


2.189 


0.397 


0.271 


-0.025 




Tukey 


0.066 


0.243 


0.079 


0.220 


0.115 


0.025 


-0.001 


Best >> 


TWLS 


0.011 


0.025 


0.008 


0.007 


0.016 


0.003 


0.000 


2ndBest >> 


LP 


0.034 


0.057 


0.027 


0.050 


0.030 


0.007 


0.000 




LAD 


0.369 


0.634 


-0.029 


-0.166 


0.375 


0.061 


0.001 



Noise Level: 




Model Error 


Translation Error 


Rotation Error | 


1/2 Pixel 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


92.290 

13.408 


963.342 

45.741 


10.273 


-55.591 


-21.364 


6.133 


0.323 




BLS 


13.564 


44.147 


10.273 


-55.591 


-21.364 


5.828 


0.308 


Best >> 


WLS 


0.375 


1.267 


-0.104 


0.623 


0.140 


0.176 


0.004 




LMedS 


12.234 


25.086 


-30.495 


-8.716 


-10.698 


3.217 


-0.011 




Tukey 


10.372 


35.535 


-21.900 


-10.043 


-21.822 


3.080 


0.009 


Best >> 


TWLS 


0.435 


1.499 


-0.348 


0.690 


0.152 


0.217 


0.004 


2ndBest » 


LP 


1.425 


4.331 


0.465 


1.952 


-0.850 


0.093 


0.022 




LAD 


8.252 


24.580 


-13.879 


-2.843 


-18.381 


1.793 


0.020 



Noise Level: 




Model Error 


Translation Error 


Rotation Error I 


1 Pixel 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


168.183 

14.814 


1387.741 

42.577 


2.788 


10.394 


-37.582 


0.814 


-0.065 




BLS 


14.694 


42.582 


2.788 


10.394 


-37.582 


0.797 


-0.067 


Best >> 


WLS 


1.140 


3.112 


-0.651 


0.840 


0.798 


0.222 


-0.003 




LMedS 


29.457 


73.773 


61.861 


0.600 


33.393 


3.431 


0.695 




Tukey 


7.942 


26.050 


3.363 


19.819 


-15.533 


1.736 


-0.136 


Best >> 


TWLS 


1.128 


2.617 


-0.804 


1.583 


0.862 


0.113 


-0.008 


2ndBest » 


LP 


2.932 


9.981 


0.062 


-1.439 


0.443 


1.040 


0.028 




LAD 


9.444 


27.931 


-2.572 


-19.194 


7.909 


3.296 


0.286 



Noise Level: 




Model Error 


Translation Error 


Rotation Error I 


2 Pixels 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


322.225 

42.226 


3487.358 

117.004 


-2.136 


22.557 


-109.680 


0.999 


-0.143 




BLS 


41.955 


117.310 


-2.136 


22.557 


-109.680 


1.054 


-0.148 


Best >> 


WLS 


3.110 


10.089 


-1.587 


0.521 


1.056 


0.887 


0.012 




LMedS 


66.939 


158.797 


139.507 


2.509 


-73.861 


7.002 


1.385 




Tukey 


56.196 


170.891 


-44.190 


38.413 


-137.884 


4.960 


-0.376 


Best >> 


TWLS 


2.930 


8.785 


-1.981 


2.500 


1.232 


0.604 


0.001 


2ndBest » 


LP 


7.786 


18.776 


5.453 


17.528 


-3.897 


1.191 


-0.032 




LAD 


47.899 


150.735 


18.134 


22.658 


-120.281 


5.341 


-0.242 



Table 1. Noise resistance test results. See text. 
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Outliers: 




Model Error 


Translation Error 


Rotation Error 


1 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

ESQ 


15.898 

23.180 


2024.944 

74.252 


-89.455 


50.472 


-7.990 


4.744 


-1.016 




BLS 


23.181 


74.182 


-89.455 


50.472 


-7.990 


4.740 


-1.015 




WLS 


33.218 


110.224 


39.925 


117.020 


26.050 


13.710 


0.304 


Best » 


LMedS 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


Best » 


Tukey 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 




TWLS 


45.068 


82.139 


8.126 


82.139 


44.938 


0.000 


0.000 


2ndBest » 


LP 


0.089 


0.246 


-0.080 


0.128 


-0.070 


0.031 


0.001 




LAD 


0.282 


0.768 


-0.744 


-0.234 


0.304 


0.089 


-0.001 



Outliers: 

10 


Estimator 


Model Error 


Translation Error 


Rotation Error 


Avr 


Max 


X 


Y 


z 


Axis 


About 




RAW 


226.326 


6563.388 














ESQ 


160.496 


454.014 


186.285 


-67.371 


-353.152 


4.125 


-1.324 




BLS 


161.660 


450.713 


186.285 


-67.371 


-353.152 


4.025 


-1.298 




WLS 


3020.721 


7003.426 


2567.422 


-265.045 


-3246.323 


54.197 


-156.784 


Best » 


LMedS 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


Best » 


Tukey 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 




TWLS 


1523.436 


3173.990 


512.954 


-883.363 


-3173.990 


0.000 


0.000 




LP 


0.173 


0.558 


-0.463 


0.116 


-0.274 


0.027 


0.007 


2ndBest » 


LAD 


0.002 


0.004 


0.003 


-0.001 


-0.001 


0.000 


0.000 



Outliers: 

30 


Estimator 


Model Error 


Translation Error 


Rotation Error 


Avr 


Max 


X 


Y 


z 


Axis 


About 




LSQ 


508.483 


1471.398 


-8.416 


302.877 


-1296.576 


69.448 


-0.940 




BLS 


508.633 


1460.329 


-8.416 


302.877 


-1296.576 


70.357 


-0.794 




WLS 


3286.981 


9146.010 


-1052.119 


382.434 


2836.719 


23.806 


-173.704 


Best » 


LMedS 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


Best » 


Tukey 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 




TWLS 


1209.244 


2904.940 


-434.435 


288.356 


-2904.940 


0.000 


0.000 




LP 


0.486 


1.165 


-0.137 


0.111 


-1.116 


0.024 


0.000 


2ndBest » 


LAD 


0.002 


0.005 


-0.001 


0.001 


0.004 


0.000 


0.000 



Outliers: 

45 


Estimator 


Model Error 


Translation Error 


Rotation Error 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 


1014.901 


6427.995 














LSQ 


705.569 


2071.503 


97.802 


218.575 


-1871.529 


41.406 


0.324 




BLS 


713.984 


2060.130 


97.802 


218.575 


-1871.529 


40.941 


0.591 




WLS 


3644.573 


10070.096 


-1219.469 


-486.973 


1899.952 


42.737 


-173.704 


Best » 


LMedS 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


Best » 


Tukey 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 




TWLS 


840.651 


1907.990 


-66.158 


547.805 


-1907.990 


0.000 


0.000 




LP 


10.872 


30.298 


-0.809 


0.573 


-29.396 


0.100 


0.019 


2ndBest » 


LAD 


0.008 


0.025 


0.009 


0.016 


-0.006 


0.003 


0.000 



Table 2. Outliers resistance test resuits. See text. 
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Outliers: 45 




Model Error 


Translation Error 


Rotation Error 




Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


1022.459 

816.918 


7436.217 

2743.550 


486.938 


1516.712 


-1475.456 


76.481 


-16.618 




BLS 


817.529 


2708.373 


486.938 


1516.712 


-1475.456 


76.741 


-16.275 




WLS 


3859.806 


11371.681 


41.487 


-909.021 


1120.821 


87.550 


-171.161 


Best » 


LMedS 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


Best » 


Tukey 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 




TWLS 


692.503 


1319.766 


-73.745 


683.999 


-1319.766 


90.000 


0.000 




LP 


189.048 


734.462 


231.241 


674.088 


-156.541 


78.969 


-7.978 




LAD 


189.121 


732.005 


231.881 


669.770 


-161.131 


79.044 


-7.882 



Table 3. Outliers resistance test resuits. See text. 



4.3 Combined Error Test 

In this test, noise and outliers error were combined. Table 0 shows the result for the 
combined test. We can see that the lp estimator produced the best result until level of 
30% outliers. (See future plans below). 

4.4 Wide Field of View Test 

The combined error test was repeated - this time for field of view of ±80°. The results 
appear in Table |3 The advantage of the lp estimator increases and it produced the best 
results for all cases. 

5 Future Plans - Enhancing the lp Estimator 

The current lp estimator is a single iteration estimator that uses all data points with 
equal weights. In order to improve robustness - random sampling can be used. In order 
to improve accuracy - iterative re-weighting can used. Since the lp estimator have better 
resistance to outliers than lsq based estimators, it is possible to allow some outliers into 
the data set. By doing so Equation Qbecomes: 




where s is the number of selected points and r is the maximum number of allowed 
outliers. For example, it is quite feasible to select 50 points while reducing the number 
of outliers from 30% to 20%. Weights can be selected in a similar manner to the tukey 
M-EST iMATOR Weights. Fortunately, the standard objective function of a linear program 
is already formed with a weight vector: Min : X. We just add the vector C to 

the linear program. The vector C code both weights and random selection (by assigning 
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Noise: 1/2 




Model Error 


Translation Error 


Rotation Error 


Outliers: 10 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

ESQ 


330.354 

211.162 


6461.861 

720.278 


244.663 


-332.714 


-465.684 


57.474 


0.137 




BLS 


IW.l'il 


711.218 


244.663 


-332.714 


-465.684 


55.605 


0.215 




WLS 


3341.308 


5777.464 


-634.204 


275.104 


-5077.698 


31.155 


-172.429 




LMedS 


12.949 


26.111 


-13.468 


6.468 


14.576 


1.925 


0.001 


2ndBest » 


Tukey 


7.904 


27.983 


-25.376 


-14.596 


11.418 


3.599 


0.015 




TWLS 


937.255 


1455.670 


274.821 


1080.893 


-1455.206 


0.087 


0.009 


Best » 


LP 


2.215 


7.201 


-0.113 


-5.457 


-2.897 


1.057 


0.044 


2ndBest 


LAD 


8.449 


23.724 


-12.685 


-32.129 


-1.661 


4.989 


0.148 



Noise: 1 




Model Error 


Translation Error 


Rotation Error 


Outliers: 10 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

ESQ 


410.494 

208.935 


6462.574 

713.489 


248.853 


-323.641 


-460.194 


56.965 


0.160 




BLS 


207.474 


704.142 


248.853 


-323.641 


-460.194 


55.037 


0.238 




WLS 


3354.821 


5823.751 


-794.238 


464.793 


-5064.401 


31.989 


-169.974 




LMedS 


75.463 


204.472 


-194.590 


-67.718 


113.134 


21.553 


0.601 


2ndBest » 


Tukey 


12.569 


35.132 


-49.394 


-29.450 


-2.073 


7.187 


0.029 




TWLS 


1033.217 


1709.669 


307.170 


1081.765 


-1708.412 


0.254 


0.021 


Best » 


LP 


4.264 


14.712 


-1.869 


-9.889 


-6.171 


1.996 


0.081 




LAD 


16.914 


48.689 


-24.899 


-66.783 


-4.251 


10.224 


0.263 



Noise: 1/2 




Model Error 


Translation Error 


Rotation Error 


Outliers: 20 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


534.974 

317.323 


6457.304 

900.667 


-60.901 


15.641 


-868.523 


6.686 


0.140 




BLS 


311.001 


899.635 


-60.901 


15.641 


-868.523 


6.717 


0.041 




WLS 


3478.124 


5790.135 


-191.580 


-322.390 


-5459.968 


34.674 


-171.707 




LMedS 


27.519 


83.162 


-64.810 


107.649 


9.294 


8.835 


-1.079 


2ndBest » 


Tukey 


8.830 


28.039 


11.958 


0.238 


-21.447 


1.666 


0.014 




TWLS 


1037.784 


2158.402 


262.833 


691.552 


-2157.532 


0.200 


-0.014 


Best » 


LP 


3.745 


10.453 


-4.840 


0.323 


-7.248 


0.751 


-0.015 




LAD 


24.196 


77.657 


17.441 


-13.986 


-59.051 


3.699 


0.178 



Noise: 1/2 




Model Error 


Translation Error 


Rotation Error 


Outliers: 30 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


760.803 

524.166 


6462.767 

1726.181 


-578.041 


-433.829 


-1216.117 


70.138 


-3.891 




BLS 


520.750 


1712.118 


-578.041 


-433.829 


-1216.117 


68.799 


-3.778 




WLS 


4472.985 


8330.073 


-1299.379 


-1295.965 


-8079.397 


33.428 


-167.652 


2ndBest » 


LMedS 


16.647 


59.631 


54.875 


3.462 


-29.917 


8.572 


0.000 


Best » 


Tukey 


14.425 


49.392 


-1.603 


-65.567 


-18.837 


8.063 


0.376 




TWLS 


789.280 


1256.991 


444.825 


1256.446 


-664.966 


0.248 


0.007 




LP 


26.023 


92.428 


-70.492 


-39.799 


-44.107 


10.513 


0.202 




LAD 


39.923 


135.584 


-65.520 


-31.254 


-84.921 


11.804 


0.179 



Table 4. Combined Noise and Outliers resistance test results. See text. 
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Noise: 1/2 




Model Error 


Translation Error 


Rotation Error 


Outliers: 10 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

ESQ 


540.999 

149.257 


6085.128 

494.011 


144.449 


-122.759 


-328.790 


36.367 


0.671 




BLS 


149.452 


485.989 


144.449 


-122.759 


-328.790 


34.432 


0.672 




WLS 


2311.542 


6197.539 


-1076.314 


1203.634 


-1669.289 


30.416 


-160.070 




LMedS 


136.350 


308.560 


66.910 


-31.657 


-280.050 


5.013 


-0.835 




Tukey 


43.952 


124.268 


-58.412 


40.524 


-78.515 


7.686 


-0.400 




TWLS 


1427.543 


2133.874 


-728.787 


2122.566 


-1424.366 


1.362 


0.078 


Best » 


LP 


8.570 


23.198 


-4.450 


4.890 


-8.686 


2.441 


0.118 


2ndBest » 


LAD 


24.561 


65.739 


-2.359 


-3.416 


-37.169 


4.548 


-0.544 



Noise: 1 




Model Error 


Translation Error 


Rotation Error 


Outliers: 10 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


1010.969 

554.204 


26429.873 

2049.524 


812.911 


-673.969 


-877.471 


75.363 


-12.393 




BLS 


548.012 


1982.529 


812.911 


-673.969 


-877.471 


76.089 


-11.444 




WLS 


2435.348 


6633.286 


-1186.128 


2156.349 


-1824.252 


29.462 


-158.329 




LMedS 


278.094 


863.033 


-127.341 


-50.840 


-665.746 


43.113 


0.046 




Tukey 


477.633 


1747.181 


628.401 


-523.574 


-808.699 


75.444 


-8.878 




TWLS 


1587.257 


2959.155 


-216.615 


2922.414 


-1615.280 


5.196 


0.056 


Best » 


LP 


42.467 


125.347 


3.214 


17.726 


-104.093 


4.168 


-0.278 




LAD 


149.153 


446.592 


-13.903 


9.976 


-329.151 


78.027 


-2.051 



Noise: 1/2 




Model Error 


Translation Error 


Rotation Error 


Outliers: 20 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


644.872 

222.049 


7978.731 

704.285 


38.317 


-86.282 


-595.499 


25.763 


0.750 




BLS 


218.571 


688.169 


38.317 


-86.282 


-595.499 


22.022 


0.708 




WLS 


3290.707 


7875.635 


-3935.027 


-2673.111 


-2663.154 


28.169 


-168.929 




LMedS 


187.139 


469.149 


58.104 


24.594 


-410.256 


12.638 


-0.308 




Tukey 


1295.181 


2464.983 


949.616 


-1625.748 


1940.417 


45.014 


-4.116 




TWLS 


2602.932 


5943.315 


958.470 


5937.370 


-909.445 


0.749 


0.026 


Best » 


LP 


31.054 


111.478 


-45.373 


27.961 


-50.135 


12.499 


-0.273 


2ndBe.st » 


LAD 


82.355 


243.906 


-14.799 


21.514 


-219.603 


6.353 


0.147 



Noise: 1/2 




Model Error 


Translation Error 


Rotation Error 


Outliers: 30 


Estimator 


Avr 


Max 


X 


Y 


Z 


Axis 


About 




RAW 

LSQ 


749.124 

311.688 


4368.621 

998.178 


-350.897 


-116.254 


-695.196 


50.792 


-2.239 




BLS 


310.007 


992.931 


-350.897 


-116.254 


-695.196 


50.034 


-2.296 




WLS 


2751.972 


7107.815 


-3003.917 


-1.453 


-2270.162 


35.341 


-171.986 




LMedS 


218.722 


778.933 


-325.832 


440.850 


-114.510 


60.658 


-8.958 




Tukey 


110.744 


342.292 


-97.494 


138.649 


228.195 


16.395 


-1.663 




TWLS 


1964.493 


4265.691 


448.350 


4255.876 


-1184.232 


0.971 


0.077 


Best » 


LP 


62.339 


216.881 


-114.129 


26.686 


-94.911 


25.224 


-1.053 


2ndBest » 


LAD 


86.095 


280.483 


-45.449 


8.512 


-217.496 


13.088 


-0.300 



Table 5. Wide FOV test test results. See text. 
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zero weight to n on selected points). (Most linear program solvers optimize constraints 
with zero effect on the objective function, so the efficiency is not affected by zero weight 
values). Note that this section also applies to the lad estimator. 

6 Conclusions 

1. As expected, the maximum likelihood estimators (s.a. wls) produced the best esti- 
mations in the presence of noise (zero mean noise, not necessarily Gaussian). The 
robust estimators (such as lmeds) produced the best estimations in the presence of 
outliers without any noise. The lp estimation was better than the maximum like- 
lihood estimations in the presence of outliers and better than the robust estimations 
in the presence of noise and better then both in the presence of noise and outliers - 
especially for wide field of view. 

2. The LP estimator was tested with maximum possible outlier error (in magnitude) 
within the frame of the image - no leverage point cases were observed - this result 
complies with previous results that appeared in Q- 

3. The current lp estimator which is single iteration, uniform weight and includes all 
data points can be enhanced to exploit random sampling and iterative re-weighing 
within the standard framework of linear programming. 
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Abstract. A new approach to characterizing the performance of point- 
correspondence algorithms is presented. Instead of relying on any “gro- 
und truth’, it uses the self-consistency of the outputs of an algorithm 
independently applied to different sets of views of a static scene. It al- 
lows one to evaluate algorithms for a given class of scenes, as well as to 
estimate the accuracy of every element of the outpnt of the algorithm 
for a given set of views. Experiments to demonstrate the usefnlness of 
the methodology are presented. 



1 Introduction and Motivation 

Our visual system has a truly remarkable property: given a static natural scene, 
the perceptual inferences it makes from one viewpoint are almost always con- 
sistent with the inferences it makes from a different viewpoint. We call this 
property self-consistency. 

The ultimate goal of our research is be able to design computer vision al- 
gorithms that are also self-consistent. The first step towards achieving this goal 
is to measure the self-consistency of the inferences of current computer vision 
algorithm over many scenes. An important refinement of this is to measure the 
self-consistency of subsets of an algorithm’s inferences, subsets that satisfy cer- 
tain measurable criteria, such as having a “high confidence.” 

Once we can measure the self-consistency of an algorithm, and we observe 
that this measure remains reasonably constant over many scenes (at least for 
certain subsets), then we can be reasonably confident that the algorithm will 
be self-consistent over new scenes. More importantly, such algorithms are also 
likely to exhibit the self-consistency property of the human visual system: given a 
single view of a new scene, such an algorithm is likely to produce inferences that 
would be self-consistent with other views of the scene should they become available 

* This work was sponsored in part by the Defense Advanced Research Projects Agency 
under contract F33615-97-C-1023 monitored by Wright Laboratory. The views and 
conclusions contained in this document are those of the authors and should not be 
interpreted as representing the official policies, either expressed or implied, of the 
Defense Advanced Research Projects Agency, the United States Government, or SRI 
International. 
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later. Thus, measuring self-consistency is a critical step towards discovering (and 
eventually designing) self-consistent algorithms. It could also be used to learn 
the parameters of an algorithm that lead to self-consistency. 

There are a number of caveats that one needs to make with regards to self- 
consistency. 

First, self-consistency is a necessary, but not sufficient, condition for a com- 
puter vision algorithm to be correct. That is, it is possible (in principle) for 
a computer vision algorithm to be self-consistent over many scenes but be se- 
verely biased or entirely wrong. We conjecture that this cannot be the case for 
non-trivial algorithms. If bias can be ruled out, then the self-consistency distri- 
bution becomes a measure of the accuracy of an algorithm — one which requires 
no “ground truth.” 

Second, self-consistency must be measured over a wide variety of scenes to 
be a useful predictor of self-consistency over new scenes. In practice, one can 
measure self-consistency over certain classes of scenes, such as close-up views of 
faces, or aerial images of natural terrain. 

In the remainder of this paper we develop a particular formalization of self- 
consistency and an instantiation of this formalism for the case of stereo (or, in 
general, multi-image point-correspondence) algorithms. We then present mea- 
surements of the self-consistency of some stereo algorithms to a variety of real 
images to demonstrate the utility of these measurements and compare this to 
previous work in estimating uncertainty. 



2 A Formalization of Self-Consistency 

We begin with a simple formalization of a computer vision algorithm as a fun- 
ction that takes an observation 17 of a world W as input and produces a set of 
hypotheses H about the world as output: 

H ={hiM,---,hn) = F{n,w). 

An observation 17 is one or more images of the world taken at the same time, 
perhaps accompanied by meta-data, such as the time the image(s) was acquired, 
the internal and external camera parameters, and their covariances. 

A hypothesis h nominally refers to some aspect or element of the world (as 
opposed to some aspect of the observation), and it nominally estimates some 
attribute of the element it refers to. We formalize this with the following set of 
functions that depend on both F and 17: 

1. Ref{h), the referent of the hypothesis h (i.e., which element in the world 
that the hypothesis refers to). 

2. R{h, h') = Prob{Ref{h) = Ref(h'), an estimate of the probability that two 
hypotheses h and h', (computed from two observations of the same world), 
refer to the same object or process in the world. 

3. Att{h), an estimate of some well-defined attribute of the referent. 
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4. Acc{h), an estimate of the accuracy distribution of Att{h). When this is 
well-modeled by a normal distribution, it can be represented implicitly by 
its covariance, Cov{h). 

5. Score{h), an estimate of the confidence that Att(h) is correct. 

Intuitively, we can state that two hypotheses h and h' , derived from obser- 
vations 17 and 17' of a static world W, are consistent with each other if they 
both refer to the same object in the world and the difference in their estima- 
ted attributes is small relative to their accuracies, or if they do not refer to the 
same object. When the accuracy is well modeled by a normal distribution, the 
consistency of two hypotheses, C{h, h'), can be written as 

C{h, h') = R{h, h'){Att{h) — Att{h'))^ (Cov{h) + Cov{h'))~^ {Att{h) — Att(h'))"'" 

Note that the second term on the right is the Mahalanobis distance between 
the attributes, which we refer to as the normalized distance between attributes 
throughout this paper. 

Given the above, we can measure the self-consistency of an algorithm as 
the histogram of C{h,h') over all pairs of hypotheses in H = F{Q{W)) and 
H' — F{fi' {W)), over all observations over all suitable static worlds W . We call 
this distribution of C(/i, h') the self-consistency distribution of the computer 
vision algorithm F over the worlds W. To simplify the exposition below, we 
compute this distribution only for pairs h and h' for which R{h, h') ~ 1. We will 
discuss the utility of the full distribution in future work. 



3 Self-Consistency of Stereo Algorithms 

We can apply the above abstract self-consistency formalism to stereo algorithms 
([uni)- For the purposes of this paper, we assume that the projection matrices 
and associated covariances are known for all images. 

The hypothesis h produced by a traditional stereo algorithm is a pair of 
image coordinates (xq, xi) in each of two images, (/q, I\). In its simplest form, a 
stereo match hypothesis h asserts that the closest opaque surface element along 
the optic ray through xq is the same as the closest opaque surface element along 
the optic ray through xi. That is, the referent of h, Ref{h), is the closest opaque 
surface element along the optic rays through both xq and xi. 

Consequently, two stereo hypotheses have the same referent if their image 
coordinates are the same in one image. In other words, if we have a match in 
image pair (/o,/i) and a match in image pair (/i, J 2 ), then the stereo algorithm 
is asserting that they refer to the same opaque surface element when the coor- 
dinates of the matches in image Ii are the same. Self-consistency, in this case, 
is a measure of how often (and to what extent) this assertion is true. 

The above observation can be used to write the following set of associated 
functions for a stereo algorithm. We assume that all matches are accurate to wit- 
hin some nominal accuracy, cr, in pixels (typically cr = 1). This can be extended 
to include the full covariance of the match coordinates. 
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(a) sample image (b) self-consistency distributions (c) scatter diagrams with MDL 

score 

Fig. 1. Results on two different types of images: terrain (top) vs. tree canopy (bottom). 



1. Ref{h), The closest opaque surface element visible along the optic rays 
through the match points. 

2. R{h, h') = 1 li h and h' have the same coordinate (within cr) in one image; 
0 otherwise. 

3. Att{h), The triangulated 3D (or projective) coordinates of the surface ele- 
ment. 

4. Acc{h), The covariance of Att{h), given that the match coordinates are 
fV(xo,cr) and N{:x.Q,a) random variables. 

5. Score{h), A measure such as normalized cross-correlation or sum of squared 
differences. 

Without taking into account Score(h), the self-consistency distribution is 
the histogram of normalized differences in triangulated 3D points for pairs of 
matches with a common point in one image (Sec. 14.21) . Items 3 and 4 above will 
be expanded upon in Sec. El One way, further described in Sec. 14.. 41 to take into 
account Score{h) is to plot a scatter diagram using as x-axis Score{h), and as 
y-axis the normalized differences in triangulated 3D points. 

4 The Self-Consistency Distribution 

4.1 A Methodology for Estimating the Self-Consistency 
Distribution 

Ideally, the self-consistency distribution should be computed using all possible 
variations of viewpoint and camera parameters (within some class of variations) 
over all possible scenes (within some class of scenes). However, we can compute 
an estimate of the distribution using some small number of images of a scene, 
and average this distribution over many scenes. 




286 Y.G. Leclerc, Q.-T. Luong, and P. Fua 



Here we start with some fixed collection of images assumed to have been 
taken at exactly the same time (or, equivalently, a collection of images of a static 
scene taken over time). Each image has a unique index and associated projection 
matrix and (optionally) projection covariances. We then apply a stereo algorithm 
independently to all pairs of images in this collection^ Each such pair of images 
is an observation in our formalism. The image indices, match coordinates, and 
score, are reported in match files for each image pair. 

We now search the match files for pairs of matches that have the same co- 
ordinate in one image. For example, if a match is derived from images 1 and 
2, another match is derived from images 1 and 3, and these two matches have 
the same coordinate in image 1, then these two matches have the same referent. 
Such a pair of matches, which we call a common-point match set, should be 
self-consistent because they should correspond to the same point in the world. 
This extends the principle of the trinocular stereo constraint to arbitrary 

camera configurations and multiple images. 

Given two matches in a common-point match set, we can now compute the 
distance between their triangulations, after normalizing for the camera configu- 
rations (see Sec. 0. The histogram of these normalized differences, computed 
over all common-point matches, is our estimate of the self-consistency distribu- 
tion. 

Another distribution that one could compute using the same data files would 
involve using all the matches in a common-point match set, rather than just 
pairs of matches. For example, one might use the deviation of the triangulations 
from the mean of all triangulations within a set. This is problematic for several 
reasons. 

First, there are often outliers within a set, making the mean triangulation 
less than useful. One might mitigate this by using a robust estimation of the 
mean. But this depends on various (more or less) arbitrary parameters of the 
robust estimator that could change the overall distribution. 

Second, and perhaps more importantly, we see no way to extend the norma- 
lization used to eliminate the dependence on camera configurations, described 
in Sec. 0 to the case of multiple matches. 

Third, we see no way of using the above variants of the self-consistency 
distribution for change detection. 



4.2 An Example of the Self-Consistency Distribution 

To illustrate the self-consistency distribution, we first apply the above methodo- 
logy to the output of a simple stereo algorithm jZj. The algorithm first rectifies 
the input pair of images and then searches for 7x7 windows along scan lines that 
maximize a normalized cross-correlation metric. Sub-pixel accuracy is achieved 
by fitting a quadratic to the metric evaluated at the pixel and its two adjacent 

^ Note that the “stereo” algorithm can find matches in n > 2 images. In this case, the 
algorithm would be applied to all subsets of size n. We use n = 2 to simplify the 
presentation here. 
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neighbors. The algorithm first computes the match by comparing the left image 
against the right and then comparing the right image against the left. Matches 
that are not consistent between the two searches are eliminated. Note that this 
is a way of using self-consistency as a filter. 

The stereo algorithm was applied to all pairs of five aerial images of bare 
terrain, one of which is illustrated in the top row of Figure ^a). These images 
are actually small windows from much larger images (about 9000 pixels on a 
side) for which precise ground control and bundle adjustment were applied to 
get accurate camera parameters. 

Because the scene consists of bare, relatively smooth, terrain with little ve- 
getation, we would expect the stereo algorithm described above to perform well. 
This expectation is confirmed anecdotally by visually inspecting the matches. 

However, we can get a quantitative estimate for the accuracy of the algorithm 
for this scene by computing the self-consistency distribution of the output of the 
algorithm applied to the ten images pairs in this collection. Figure d(b) shows 
two versions of the distribution. The solid curve is the probability density (the 
probability that the normalized distance equals x). It is useful for seeing the mode 
and the general shape of the distribution. The dashed curve is the cumulative 
probability distribution (the probability that the normalized distance is less than 
x). It is useful for seeing the median of the distribution (the point where the curve 
reaches 0.5) or the fraction of match pairs with normalized distances exceeding 
some value. 

In this example, the self-consistency distribution shows that the mode is 
about 0.5, about 95% of the normalized distances are below 1, and that about 
2% of the match pairs have normalized distances above 10. 

In the bottom row of Figure Q we see the self-consistency distribution for 
the same algorithm applied to all pairs of five aerial images of a tree canopy. 
Such scenes are notoriously difficult for stereo algorithms. Visual inspection of 
the output of the stereo algorithm confirms that most matches are quite wrong. 
This can be quantified using the self-consistency distribution in Figure mb). 
Here we see that, although the mode of the distribution is still about 0.5, only 
10% of the matches have a normalized distance less than 1, and only 42% of the 
matches have a normalized distance less than 10. 

Note that the distributions illustrated above are not well modelled using 
Gaussian distributions because of the predominance of outliers (especially in the 
tree canopy example). This is why we have chosen to compute the full distribu- 
tion rather than use its variance as a summary. 

4.3 Conditionalization 

As mentioned in the introduction, the global self-consistency distribution, while 
useful, is only a weak estimate of the accuracy of the algorithm. This is clear 
from the above examples, in which the unconditional self-consistency distribution 
varied considerably from one scene to the next. 

However, we can compute the self-consistency distribution for matches having 
a given “score” (such as the MDL-base score described in detail in Appendix A) . 
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This is illustrated in Figure ^c) using a scatter diagram. The scatter diagram 
shows a point for every pair of matches, the x coordinate of the point being 
the larger of the scores of the two matches, and the y coordinate being the 
normalized distance between the matches. 

There are several points to note about the scatter diagrams. First, the terrain 
example (top row) shows that most points with scores below 0 have normalized 
distances less than about 1. Second, most of the points in the tree canopy ex- 
ample (bottom row) are not self-consistent. Third, none of the points in the tree 
canopy example have scores below 0. Thus, it would seem that this score is able 
to segregate self-consistent matches from non-self-consistent matches, even when 
the scenes are radically different (see Sec. Id.. SI) . 

5 Projection Normalization 

To apply the self-consistency method to a set of images, all we need is the set 
of projection matrices in a common projective coordinate system. This can be 
obtained from point correspondences using projective bundle adjustment it™ 
and does not require camera calibration. The Euclidean distance is not invariant 
to the choice of projective coordinates, but this dependence can often be reduced 
by using the normalization described below. Another way to do so, which actually 
cancels the dependence on the choice of projective coordinates, is to compute the 
difference between the reprojections instead of the triangulations, as described 
in more detail in ED. This, however, does not cancel the dependence on the 
relative geometry of the cameras. 

5.1 The Mahalanobis Distance 

Assuming that the contribution of each individual match to the statistics is the 
same ignores many imaging factors like the geometric configuration of the came- 
ras and their resolution, or the distance of the 3D point from the cameras. There 
is a simple way to take into account all of these factors, applying a normaliza- 
tion which make the statistics invariant to these imaging factors. In addition, 
this mechanism makes it possible to take into account the uncertainty in camera 
parameters, by including them into the observation parameters. 

We assume that the observation error (due to image noise and digitalization 
effects) is Gaussian. This makes it possible to compute the covariance of the 
reconstruction given the covariance of the observations. Let us consider two 
reconstructed estimates of a 3-D point. Mi and M2 to be compared, and their 
computed covariance matrices Ai and A2. We weight the squared Euclidean 
distance between Mi and M2 by the sum of their covariances. This yields the 
squared Mahalanobis distance: (Mi — M2)^(Ai -|- A2)“^(Mi — M2) . 

5.2 Determining the Reconstruction and Reprojection Covariances 

If the measurements are modeled by the random vector x, of mean xg and of 
covariance Ax, then the vector y = /(x) is a random vector of mean is /(xg) 
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and, up to the first order, covariance J/(xo)AxJ/(xq)^, where J/(xo) is the 
Jacobian matrix of /, at the point xg. 

In order to determine the 3-D distribution error in reconstruction, the vector 
X is defined by concatenating the 2-D coordinates of each point of the match, 
ie [xi, t/i, X 2 , 2 / 2 , • • ■ a;„, j/ri] and the result of the function is the 3-D coordina- 
tes A, Y,Z of the point M reconstructed from the match, in the least-squares 
sense. The key is that M is expressed by a closed-form formula of the form 
M = (L^L)“^L^b, where L and b are a matrix and vector which depend on 
the projection matrices and coordinates of the points in the match. This makes 
it possible to obtain the derivatives of M with respect to the 2n measurements 
Wi,i = 1. ..n,w = x,y. We also assume that the errors at each pixel are in- 
dependent, uniform, and isotropic. The covariance matrix Ax is then diagonal, 
therefore each element of Am can be computed as a sum of independent terms 
for each image. 

The above calculations are exact when the mapping between the vector of 
coordinates of rrii and M (resp. m' and M') is linear, since it is only in that case 
that the distribution of M and M' is Gaussian. The reconstruction operation is 
exactly linear only when the projection matrices are affine. However, the linear 
approximation is expected to remain reasonable under normal viewing conditi- 
ons, and to break down only when the projection matrices are in configurations 
with strong perspective. 




random general projections perturbed projections random affine projections 
Fig. 2. Un-normalized (top) vs normalized (bottom) self-consistency distributions. 
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6 Experiments 

6.1 Synthetic Data 

In order to gain insight into the nature of the normalized self-consistency distri- 
butions, we investigate the case when the noise in point localization is Gaussian. 

We first derive the analytical model for the self-consistency distribution in 
that case. We then show, using monte-carlo experiments that, provided that 
the geometrical normalization described in SecJSlis used, the experimental self- 
consistency distributions fit this model quite well when perspective effects are 
not strong. A consequence of this result is that under the hypothesis that the 
error localization of the features in the images is Gaussian, the self-consistency 
distribution could be used to recover exactly the accuracy distribution. 

Modeling the Gaussian self-consistency distributions. The squared Mahalanobis 
distance in 3D follows a chi-square distribution with three degrees of freedom: 

In our model, the Mahalanobis distance is computed between M , M' , recon- 
structions in 3D, which are obtained from matches m^, m' of which coordinates 
are assumed to be Gaussian, zero-mean and with standard deviation a. If M, 
M' are obtained from the coordinates mj, to' with a linear transformation A, 
A', then the covariances are u^AA^, cr^A'A'^. The Mahalanobis distance follows 
the distribution: 

dg = (1) 

Using the Mahalanobis distance, the self-consistency distributions should be 
statistically independent of the 3D points and projection matrices. Of course, if 
we were just using the Euclidean distance, there would be no reason to expect 
such an independence. 

Comparison of the normalized and unnormalized distributions To explore the 
domain of validity of the first-order approximation to the covariance, we have 
considered three methods to generate random projection matrices: 

1. General projection matrices are picked randomly. 

2. Projection matrices are obtained by perturbing a fixed, realistic matrix 
(which is close to affine). Entries of this matrix are each varied randomly 
within 500% of the initial value. 

3. Affine projection matrices are picked randomly. 

Each experiment in a set consisted of picking random 3D points, random 
projection matrices according to the configuration previously described, projec- 
ting them, adding random Gaussian noise to the matches, and computing the 
self-consistency distributions by labelling the matches so that they are perfect. 
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To illustrate the invariance of the distribution that we can obtain using the 
normalization, we performed experiments where we computed both the norma- 
lized version and the unnormalized version of the self-consistency. As can be 
seen in Fig. ^ using the normalization reduced dramatically the spread of the 
self-consistency curves found within each experiment in a set. In particular, in 
the two last configurations, the resulting spread was very small, which indicates 
that the geometrical normalization was successful at achieving invariance with 
respect to 3D points and projection matrices. 

Comparison of the experimental and theoretical distributions Using the Maha- 
lanobis distance, we then averaged the density curves within each set of experi- 
ments, and tried to fit the model described in Eq. Q] to the resulting curves, for 
six different values of the standard deviation, a — 0.5, 1, 1.5, 2, 2.5, 3. As illust- 
rated in Fig. El the model describes the average self-consistency curves very well 
when the projection matrices are affine (as expected from the theory), but also 
when they are obtained by perturbation of a fixed matrix. When the projection 
matrices are picked totally at random, the model does not describe the curves 
very well, but the different self-consistency curves corresponding to each noise 
level are still distinguishable. 

6.2 Comparing Two Algorithms 

The experiments described here and in the following section are based on the 
application of stereo algorithms to seventeen scenes, each comprising five images, 
for a total of 85 images and 170 image pairs. At the highest resolution, each image 
is a window of about 900 pixels on a side from images of about 9000 pixels on 
a side. Some of the experiments were done on Gaussian-reduced versions of the 
images. These images were controlled and bundle-adjusted to provide accurate 
camera parameters. 

A single self-consistency distribution for each algorithm was created by mer- 
ging the scatter data for that algorithm across all seventeen scenes. In previous 
papers, cmi, we compared two algorithms, but using data from only four 
images. By merging the scatter data as we do here, we are now able to compare 
algorithms using data from many scenes. This results in a much more compre- 
hensive comparison. 

The merged distributions are shown in Figure 0 as probability density func- 
tions for the two algorithms. The solid curve represents the distribution for our 
deformable mesh algorithm |B|, and the dashed curve represents the distribution 
for the stereo algorithm described above. 

Comparing these two graphs shows some interesting differences between the 
two algorithms. The deformable mesh algorithm clearly has more outliers (mat- 
ches with normalized distances above 1), but has a much greater proportion of 
matches with distances below 0.25. This is not unexpected since the strength of 
the deformable meshes is its ability to do very precise matching between ima- 
ges. However, the algorithm can get stuck in local minima. Self-consistency now 
allows us to quantify how often this happens. 
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But this comparison also illustrates that one must be very careful when 
comparing algorithms or assessing the accuracy of a given algorithm. The dis- 
tributions we get are very much dependent on the scenes being used (as would 
also be the case if we were comparing the algorithms against ground truth — the 
“gold standard” for assessing the accuracy of a stereo algorithm). In general, 
the distributions will be most useful if they are derived from a well-defined class 
of scenes. It might also be necessary to restrict the imaging conditions (such as 
resolution or lighting) as well, depending on the algorithm. Only then can the 
distribution be used to predict the accuracy of the algorithm when applied to 
images of similar scenes. 

6.3 Comparing Three Scoring Functions 

To eliminate the dependency on scene content, we propose to use a score asso- 
ciated with each match. We saw scatter diagrams in Figure Qc) that illustrated 
how a scoring function might be used to segregate matches according to their 
expected self-consistency. 

In this section we will compare three scoring functions, one based on Mini- 
mum Description Length Theory (the MDL score. Appendix IXIl. the traditional 
sum-of-squared-differences (SSD) score, and the SSD score normalized by the 
localization covariance (SSD/GRAD score) |S|. All scores were computed using 
the same matches computed by our deformable mesh algorithm applied to all 
image pairs of the seventeen scenes mentioned above. The scatter diagrams for 
all of the areas were then merged together to produce the scatter diagrams show 
in Figure 0 

The MDL score has the very nice property that the confidence interval (as 
defined earlier) rises monotonically with the score, at least until there is a paucity 
of data, when then score is greater than 2. It also has a broad range of scores 
(those below zero) for which the normalized distances are below 1, with far fewer 
outliers than the other scores. 

The SSD/GRAD score also increases monotonically (with perhaps a shallow 
dip for small values of the score), but only over a small range. 

The traditional SSD score, on the other hand, is distinctly not monotonic. It 
is fairly non-self-consistent for small scores, then becomes more self-consistent, 
and then rises again. 

6.4 Comparing Window Size 

One of the common parameters in a traditional stereo algorithm is the window 
size. Figure El presents one image from six urban scenes, where each scene com- 
prised four images. Figure 0 shows the merged scatter diagrams (a) and global 
self-consistency distributions (b) for all six scenes, for three window sizes (7 x 7, 
15 X 15, and 29 x 29). Some of the observations to note from these experiments 
are as follows. 

First, note that the scatter diagram for the 7x7 window of this class of scenes 
has many more outliers for scores below -1 than were found in the scatter diagram 
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for the terrain scenes. This is reflected in the global self-consistency distribution 
in (b), where one can see that about 10% of matches have normalized distances 
greater than 6. The reason for this is that this type of scene has significant 
amounts of repeating structure along epipolar lines. Consequently, a score based 
only on the quality of fit between two windows (such as the MDL-based score) 
will fail on occasion. A better score would include a measure of the uniqueness 
of a match along the epipolar line as a second component. We are currently 
exploring this. 

Second, note that the number of outliers in both the scatter diagram and the 
self-consistency distributions decreases as window size decreases. Thus, large 
window sizes (in this case) produce more self-consistent results. But it also pro- 
duces fewer points. This is probably because this stereo algorithm uses left- 
right/right-left equality as a form a self-consistency Alter. 

We have also visually examined the matches as a function of window size. 
When we restrict ourselves to matches with scores below -1, we observe that 
matches become sparser as window size increases. Furthermore, it appears that 
the matches are more accurate with larger window sizes. This is quite different 
from the results of Faugeras et al . — [S|. There they found that, in general, mat- 
ches became denser but less accurate as window size increased. We believe that 
this is because an MDL score below -1 keeps only those matches for which the 
scene surface is approximately fronto-parallel within the extent of the window, 
which is a situation in which larger window sizes increases accuracy. This is 
borne out by our visual observations of the matches. On the other hand, this 
result is basically in line with the results of Szeliski and Zabih 0211 , who show 
that prediction error decreases with window size. Deeper analysis of these results 
will be done in future work. 



6.5 Detecting Change 

One application of the self-consistency distribution is detecting changes in a 
scene over time. Given two collections of images of a scene taken at two points 
in time, we can compare matches (from different times) that belong to the same 
surface element to see if the difference in triangulated coordinates exceeds some 
significance level. This gives a mechanism for distinguishing changes which are 
significant from changes which are due to modelization uncertainty. More details, 
and experimental results are found in ng. 

7 Previous Work in Estimating Uncertainty 

Existing work on estimating uncertainty without ground truth falls into three 
categories: analytical, statistical, and empirical approaches. 

The analytical approaches are based on the idea of error propagation izg. 
When the output is obtained by optimizing a certain criterion (like a correlation 
measure), the shape of the optimization curve or surface Q provides 

estimates of the covariance through the second-order derivatives. These appro- 
aches make it possible to compare the uncertainty of different outputs given by 
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the same algorithm. However, it is problematic to use them to compare different 
algorithms. 

Statistical approaches make it possible to compute the covariance given only 
one data sample and a black-box version of an algorithm, by repeated runs of 
the algorithm, and application of the law of large numbers 0. 

Both of the above approaches characterize the performance of a given output 
only in terms of its expected variation with respect to additive white noise. 
In [21 ] ■ the accuracy was characterized as a function of image resolution. The 
bootstrap methodology |2| goes further, since it makes it possible to characterize 
the accuracy of a given output with respect to IID noise of unknown distribution. 
Even if such an approach could be applied to the multiple image correspondence 
problem, it would characterize the performance with respect to IID sensor noise. 
Although this is useful for some applications, for other applications it is necessary 
to estimate the expected accuracy and reliability of the algorithms as viewpoint, 
scene domain, or other imaging conditions are varied. This is the problem we 
seek to address with the self-consistency methodology. 

Our methodology falls into the real of empirical approaches. See for a 
good overview of such approaches. 

Szeliski CHI has recently proposed prediction error to characterize the perfor- 
mance of stereo and motion algorithms. Prediction error is the difference between 
a third real image of a scene and a synthetic image produced from the disparities 
and known camera parameters of the three images. This approach is especially 
useful when the primary use of stereo is for view interpolation, since the me- 
tric they propose directly measures how well the algorithm has interpolated a 
view compared to a real image of that same view. In particular, their approach 
does not necessarily penalize a stereo algorithm for errors in constant-intensity 
regions, at least for certain viewpoints. Our approach, on the other hand, at- 
tempts to characterize self-consistency for all points. Furthermore, our approach 
attempts to remove the effects of camera configuration is computing the measure 
over many observations and scenes. 

Szeliski and Zabih have recently applied this approach to comparing ste- 
reo algorithms HHEII]. A comprehensive comparison of our two methodologies 
applied to the the same algorithms and same datasets should yield interesting 
insights into these two approaches. 

An important item to note about our methodology is that the projection ma- 
trices for all of the images are provided and assumed to be correct (within their 
covariances). Thus, we assume that a match produced by the stereo algorithm 
always lies on the epipolar lines of the images. Consequently, a measure of how 
far matches lie from the epipolar line, is not relevant. 



8 Conclusion and Perspectives 

We have presented a general formalization of a perceptual observation called 
self-consistency, and have proposed a methodology based on this formalization 
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as a means of estimating the accuracy and reliability of point-correspondence al- 
gorithms algorithms, comparing different stereo algorithms, comparing different 
scoring functions, comparing window sizes, and detecting change over time. We 
have presented a detailed prescription for applying this methodology to multiple- 
image point-correspondence algorithms, without any need for ground truth or 
camera calibration, and have demonstrated it’s utility in several experiments. 

The self-consistency distribution is a very simple idea that has powerful con- 
sequences. It can be used to compare algorithms, compare scoring functions, 
evaluate the performance of an algorithm across different classes of scenes, tune 
algorithm parameters (such as window size), reliably detect changes in a scene, 
and so forth. All of this can be done for little manual cost beyond the precise esti- 
mation of the camera parameters and perhaps manual inspection of the output 
of the algorithm on a few images to identify systematic biases. 

Readers of this paper are invited to visit the self-consistency web site to 
download an executable version of the code, documentation, and examples at 
http://www.ai.sri.com/sct/ described in this paper. 

Finally, we believe that the general self-consistency formalism developed in 
Sec.O which examines the self-consistency of an algorithm across independent 
experimental trials of different viewpoints of a static scene, can be used to assess 
the accuracy and reliability of algorithms dealing with a range of computer vision 
problems. This could lead to algorithms that can learn to be self-consistent over 
a wide range of scenes without the need for external training data or “ground 
truth.” 

A The MDL Score 

Given N images, let M be the number of pixels in the correlation window and 
let gf be the image gray level of the pixel observed in image j. For image j, 
the number of bits required to describe these gray levels as IID white noise can 
be approximated by: 

Cj = M (log (Jj + c) (2) 

where Uj is the measured variance of the and c = (1/2) log(27re). 

Alternatively, these gray levels can be expressed in terms of the mean gray 
level ~gl across images and the deviations gl —'gl from this average in each indi- 
vidual image. The cost of describing the means, can be approximated by 

C = M(logCT -I- c) (3) 

where a is the measured variance of the mean gray levels. Similarly the coding 
length of describing deviations from the mean is given by 

C/ = M(loga^^ + c) (4) 

where cr/* is the measured variance of those deviations in image j. Note that, 
because we describe the mean across the images, we need only describe iV — 1 
of the Cj. The description of the A^th one is implicit. 
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The MDL score is the difference between these two coding lengths, normalized 
by the number of samples, that is 

Loss = C+ Y. C'/- ^ Q . (5) 

l<j<N-l l<i<N 

When there is a good match between images, the have a small variance. 

Consequently the Cj should be small, C should be approximately equal to any of 
the Cj and Loss should be negative. However, Cj can only be strongly negative 
if these costs are large enough, that is, if there is enough texture for a reliable 
match. See m for more details. 





perturbed projections random affine projections 
Fig. 3. Averaged theoretical (solid) and experimental (dashed) curves. 




Fig. 4. Comparing two stereo algorithms (Mesh vs Stereo) using the self-consistency 
distributions. 




MDL SSD/Grad SSD 

Fig. 5. Scatter diagrams for three different scores. 
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Fig. 6. Three of six urban scenes used for the window comparisons. Each scene con- 
tained 4 images. 




(7 X 7) (15 X 15) (29 x 29) 

Fig. 7. Comparing three window sizes, (a) The combined self-consistency distributions 
of six urban scenes for window sizes 7 x 7, 15 x 15, and 29 x 29. (b) The scatter diagrams 
for the MDL score for these urban scenes. 
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Abstract. Stereoscopic vision has a fundamental role both for animals 
and humans. Nonetheless, in the computer vision literature there is limi- 
ted reference to biological models related to stereoscopic vision and, in 
particular, to the functional properties and the organization of binocular 
information within the visual cortex. 

In this paper a simple stereo technique, based on a space variant mapping 
of the image data and a multi-layered cortical stereoscopic representa- 
tion, mimicking the neural organization of the early stages of the human 
visual system, is proposed. Radial disparity computed from a stereo pair 
is used to map the relative depth with respect to the fixation point. A set 
of experiments demonstrating the applicability of the devised techniques 
is also presented. 



1 Introduction 

Stereopsis is a fundamental visual cue for depth computation and for the orga- 
nization of spatial perception. Stereo vision has been proposed for the recovery 
of the 3D structure of a scene P but stereo techniques have been also used for 
object and face recognition m- 

The realization of anthropomorphic robots has been inspired for a long time 
by physiological and neuro-physiological studies of the human sensory-motor 
system. In particular the studies on the human visual system show how the 
structure of retina implies a space-variant representation of the visual stimuli. 
This is due to the size and distribution of the receptive fields in the retina P|: 
the resolution is very high in the center of the visual field (fovea) and decreases 
from the fovea toward the periphery. In the literature, the log-polar mapping 
has been proposed to model the non-uniform distribution of the receptive fields 
in the retina pani. By using this transformation the image information is 
represented at different spatial- frequencies, yet maintaining a fine detail within 
the neighborhood of the fixation point. 

Despite the great importance of stereopsis in many visual tasks m and the 
good properties of the log-polar mapping |B| very few researchers exploited the 
advantages of coupling stereo vision and log-polar sampling mm- 

This paper presents a stereo matching technique specially devised to bene- 
fit of the properties of the non-uniform, log-polar image representations. The 
organization of the paper is as follows. Section 2 describes the mathematical 
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formulation of the log-polar sampling while the cortical model is described in 
section 3. The stereo matching technique is described in section 4. Finally, section 
5 and 6 describe some experimental results and outline future developments. 



2 Log-Polar Sampling 

Many studies about the human visual system show that the retina performs 
a spatial sampling that can be modeled by a discrete distribution of elements 
(receptive fields) whose size increase linearly with the eccentricity. The trans- 
formation of the retinal image onto its cortical projection can be described by 
a mapping of the cartesian coordinates on the retinal plane (x, y) into polar co- 
ordinates (p, 9) and then in the cortical plane [log (p) , 9\. The diagram in figure 
n sketches this mapping for one receptive field covering several pixels on the 
cartesian plane. The formalization of the retino-cortical transformation (figure 




Fig. 1. Retino-cortical transformation 

121 ) used in this paper likely differs from the models proposed by m and 
The advantage of the proposed transformation is the possibility to control in a 
simple way the overlap among neighboring receptive fields. The parameters re- 
quired for log-polar sampling are the number of eccentricities (Nr), the number 
of receptive fields per eccentricity {Na) and the radial and angular overlap of 
neighboring receptive fields (Or and Oa)- The overlap which occurs along each 
circle is controlled by the parameter Kq: 



Ko = 
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Fig. 2. Retinal model 



where Si is the radius of the considered receptive field and pi is the eccentricity. 
For adjacent fields the Kq value can be obtained considering that 

2t: Pi = 2SiNa- 

Hence, we can observe that there is an overlap along each circle only when 

Concerning the overlap between receptive fields at different eccentricities a se- 
cond parameter Ki is defined 



Ki = 



S, 

S^-i 



Pt 



Pi-1 



The relationship between Kq and K\ for adjacent receptive fields can be expres- 
sed as: 

Pi 1 - 1 - Kq 

Ai = = . 

p,_l 1 - Kq 

There is an overlap along each radial coordinate when 



Ki < 



1 + Kq 
1-Kq 
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In practice, given the parameters Nr, 
as follows: 






Na, Or, Oa, the K values are computed 



Ki 



Or + Kq 
Or -Ko' 



Concerning the smallest eccentricity po used for the mapping, it obviously de- 
pends on the size of the receptive fields in the fovea Sq: 



Po = 



Kn 



In practice it is reasonable to assume po G [0.5 — 5], in such a way to preserve 
as much as possible the original resolution of the cartesian image. 



3 Multilayer Cortical Model 

The analysis of the primary visual cortex shows a layered organization, where 
each layer differs in response complexity; for example receptive fields can have 
very different sizes and respond strictly to a properly oriented visual stimulus. 
Crossing the cortex in a direction parallel to the layer surfaces, the organization 
of the information shows alternation from left eye to right eye and back, with 
thin stripes 0.5 millimeter wide (ocular dominance columns). 

In order to simulate this structure and the computational processing under- 
lying depth perception we introduce a cortical model obtained by fusing two 
input cortical images (see figure 0) . 



left cortical image 




rigth cortical image 




cortical model 




Fig. 3. Cortical model 
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In this model the left and right cortical images are combined along the radial 
direction. The first layer simply contains the grey level information while the 
second layer is the result of the application of a Laplacian of Gaussian and a 
zero crossing filter over the receptive fields of the retinal model. 

The Laplacian of Gaussian mask is approximated with a Difference of Gaus- 
sians (DoG) that best matches the circular symmetry of the corresponding 
receptive field: 



F{p,a^,ae) 



1 

\f^Oi 



exp 




\f^o 



■ exp 



2 ct « 



where Ui and Uf, are proportional, respectively, to the diameters of the surround 
and center regions and dijae = 2. In order to implement a discrete filter we 
calculate a priori a mask for any possible size of receptive field. For zero crossing 
extraction a simple 3x3 mask is applied to the log-polar representation. The 
slope of the zero crossing is then coded as a gray level in the range 0-254. In 
figures 0 to El the application of the multilayer cortical model to a stereo image 
pair taken from a real scene, is shown. 




Fig. 4. Cartesian stereo images 



4 Cortical Stereo 

The image projection of objects onto a stereo camera pair is subject to changes 
in position, form and, according to light, even color. The problem of establishing 
the correspondence between points in stereo images has been deeply investigated 
in the past and a number of techniques has been proposed for cartesian images. 
Gonsidering space-variant images some additional issues have to be taken into 
account: 

1. the epipolar constraint, which is usually exploited for stereo matching, on the 
cortical plane becomes a non-linear function which projects curves instead 
of straight lines on the image plane; 
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Fig. 5. (top) Log-polar transformation of left and right grey level images; (bottom) 
The resulting layer 1 of the cortical model where left and right columns are combined. 




Fig. 6. (top) Log-polar transformation of left and right contour images; (bottom) The 
resulting layer 2 of the cortical model where left and right columns are combined. 
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2. in few special cases the projections on the two images of the same point in 
space lie on the opposite sides (hence very far one from the other) of the 
log-polar plane. This may happen whenever the point in space is located 
within the inter-ocular cone of the visual field between the cameras and the 
fixation point, as shown in figure Q. On the other hand, in a real system the 
fixation mechanism will control the vergence of the stereo cameras in order 
to fixate the closest point within the inter-ocular visual cone. 




Fig. 7. A point belonging to the inter-ocular visual cone produces projections on the 
log-polar plane which can be very far one from the other. As a consequence this point 
cannot be matched. 



The stereo matching process relies on the multi-layer information of the corti- 
cal representation. First of all candidate points are selected by looking at the zero 
crossing image (layer 2). Then, candidates in adjacent columns are compared by 
using the gray level information (layer 1). Figure 0 depicts this process in the 
cortical plane: two candidate contour points A{0A,log{pA)) and H{9B,log{pB)) 
are compared by computing the following expression: 

C{0A,log{pA),9B,log{pB)) = 

[Ic{0A, log{pA) +i) - Ic{0b, log{pB) + *)] • m 

i 

where is the cortical layer 1 and Wi G [0, 1] is a normalizing weight factor. 
The index i usually cover a small range around the candidate point, few pixels 
at most. If the computed value of C() exceeds a given threshold the two candi- 
dates are matched and they are removed from the candidate list. The disparity 
computed along the radial direction is expressed by: 



dr = log (pa) - log{pB) 
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Fig. 8. Two candidate corresponding points, derived from layer 2, are compared by 
using gray level information in a small neighborhood (1x7 pixels wide). 



This value can be interpreted with reference to figure 0 where the sign of the 
disparity is related to the position of the object with respect to the fixation 
point. For instance, an object located behind the fixation point and in the right 
side has negative radial disparity . 

5 Experimental Results 

In order to assess the validity of the proposed approach, several experiments have 
been performed both on synthetic and real images. Two off-the-shelf black/white 
cameras have been used, with a stereo baseline of about 200 mm. 

For the first experiment two synthetic stereo images were generated with the 
fixation point set at 1500 mm from the virtual cameras. The scene contains a 
vase, on the left side, lying between the cameras and the fixation point in space 
and a cube, on the right side, located behind the fixation point. A diagram of 
the scene and the two cartesian stereo images are shown in figure II DL 

The log-polar mapping was applied to the original stereo images generating 
the two layers of the cortical model (figure [^. The edge and gray-level space- 
variant representations of the two images are fused together to obtain the final 
cortical representation (figure 1121) . 

In figure [El the result of the stereo matching, based on the multi-layer cor- 
tical representation, is shown. In the presented maps the gray level codes the 
disparity between -127 (black) and 128 (white), with 0 represented as light gray. 
As it can be noticed, most of the pixels of the two objects have a negative dispa- 
rity. This result can be readily and correctly interpreted with the table of figure 
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Fig. 9. Disparity areas with respect to the fixation point (top view) 



0 As it can be noticed there are few errors on the upper right corner of the 
square and small portions of the vase. By comparing the computed disparity 
represented on the cortical plane and the ocular dominance representation in fi- 
gure El it appears that most of the errors are due to ambiguities in the matching 
between corresponding contours. For example, considering the wrong disparity 
in the upper right edge of the square, as the edges of the two images are almost 
coincident there is an ambiguity along the direction of the edge. 

The same experiment described for the synthetic scene has been performed 
on images acquired from a real scene. For the first experiment two stereo images 
were acquired with the fixation point set at 1500 mm from the cameras. The 
scene contains two puppets, one on the left side, lying between the cameras and 
the fixation point in space and another, on the right side, located behind the 
fixation point. A diagram of the scene and the two stereo images are shown in 
figure ITTfl In figure fTTIthe cortical representation is shown. 

In figure El the result of the stereo matching is shown. Also in this experi- 
ments few errors are due to to ambiguities in the matching between correspon- 
ding contours. 

In the last experiment two stereo images were acquired inside the Computer 
Vision laboratory with the fixation point set at 1500 mm from the cameras. The 
scene contains several clearly distinguished objects located at different positions 
with respect to the fixation point which is 2500 mm far from the stereo cameras. 
A diagram of the scene and the two stereo images are shown in figure El Figures 
EHl and cni show the cortical model and the final result of the stereo matching 
algorithm. Even though the scene is far more complex than those presented in 
the previous experiments, still the most prominent regions in the images (the 
man seated on the right and the pole on the left) clearly show disparity values 
which are consistent with their position in space. 
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6 Conclusion 

In this paper a stereo matching technique, based on space-variant information, is 
proposed. Radial disparity is used to map the relative distance from the fixation 
point. This information, so far very rough, turns out to be quite robust even 
in complex environments. It is therefore an interesting cue for an active vision 
system, provided with fixation capabilities. Future work will be devoted to the 
refinement of the proposed technique, by using multiple layers (computed at 
different spatial resolutions) and by assessing the resolution in depth attainable 
in a neighborhood of the fixation point. 
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Fig. 10. (left) Top view of the setup adopted for the synthetic scene; (right) Synthetic 
stereo-images used in the first experiment. 
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Fig. 11. (top) Space- variant stereo images and (bottom) computed zero crossings after 
the application of the difference of Gaussians receptive field to the original stereo 
images. 






Fig. 15. Cortical representation for layer 1 (top) and layer 2 (bottom). 
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Fig. 16. Computed disparity represented on the (left) cortical and (right) cartesian 
plane. 




Fig. 17. (left) Setup for the stereo image acquisition of the “laboratory” scene; (right) 
Stereo-images acquired from the “laboratory” scene. 
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Abstract. In stereoscopic images, the behavior of a curve in space is 
related to the appearance of the curve in the left and right image planes. 
Formally, this relationship is governed by the projective geometry indu- 
ced by the stereo camera configuration and by the differential structure 
of the curve in the scene. We propose that the correspondence problem- 
matching corresponding points in the image planes-can be solved by 
relating the differential structure in the left and right image planes to 
the geometry of curves in space. Specifically, the compatibility between 
two pairs of corresponding points and tangents at those points is related 
to the local approximation of a space curve using an osculating helix. To 
guarantee robustness against small changes in the camera parameters, 
we select a specific osculating helix. A relaxation labeling network de- 
monstrates that the compatibilities can be used to infer the appropriate 
correspondences in a scene. Examples on which standard approaches fail 
are demonstrated. 



1 Introduction 



The objects in our visual environment weave through space in an endless variety 
of depths, orientations and positions (Figs. 1, 2). Nevertheless, in computer vi- 
sion the dominant theme has been to develop region-based methods to solve 
the correspondence problem or else to focus on long, straight 

lines 1^. Edge features have been introduced to reduce the complexity of mat- 
ching, but ambiguities along the epipolar lines (Fig. 0 are dealth with by a 
distance measure, e.g. similarity of orientation, over the possible edge matches. 
While this filters the number of potential matches, it does not solve the problem 
of which edge elements in each figure should correspond. 

To resolve these remaining ambiguities, global (heuristic) constraints have 
been introduced, including the uniqueness constraint |bli;-!llbl| : the ordering con- 
straint (the projections of two objects in an image must have the same left to 
right ordering as the objects in the scene ) |ilblfb) : the smoothness constraint 0 
and so on. However, each of these is heuristic, and breaks down for 
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Fig. 1. A stereo pair of twigs. There are many places in the stereo pair where the 
ordering constraint breaks down. The images were taken using a verged stereo rig and 
a baseline of 9cm. 



the above images; see Fig. 0 It is our goal to develop a stereo correspondence 
system that can work for scenes such as these. 

The occluding contour of an object weaves through space as does the object 
with which it is associated. The projection of the occluding contour is an edge 
curve that winds across the image plane. We focus on the relationship between 
the differential-geometric properties of curves in a scene and the projection of 
those curves into the left and right image planes. These observations allow us to 
explore the reasons why conventional stereo systems may fail for natural images 
and will provide the basis for our new approach. 

The distance measures that have been defined on features in the image plane, 
together with global constraints, have not fully captured the relationships bet- 
ween a space curve in the scene and the behavior of the projected curves in 
the (left, right) image planes. For example, consider matching edge elements’ by 
orientation along an epipolar line: if the threshold for determining a match were 
too high (typical values include ±30 deg |7lltij l then too many possible matches 
would be admitted; if the threshold were too low, then the system would be 
restricted to detecting edges that do not traverse through depth. Imposing addi- 
tional measures, such as the curvature of lines in the image plane, does not solve 
the problem of ambiguity because, like the measures on orientation, measures 
on curvature do not take into account the behavior of curves in three space. The 
structure of the curves in the image plane must derive from the geometry of the 
space curve. 




Fig. 2. A synthetic stereo pair consisting of a y-helix and a circular arc. 
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Fig. 3. The false matches obtained using the PMF algorithm of Pollard, Mayhew, and 
Frisby in the TINA system on the example in Fig. 2. (top) Three orthogonal projec- 
tions of the reconstruction. The rightmost ”birds-eye view” illustrates the complete 
merging of two curves in depth. The incorrect reconstruction is due to the failure of 
the ordering constraint in the image, which is highlighted in (bottom) zoom. The 
(filled) darkened pixels in the left and right images correspond in TINA according to 
the ordering constraint, but are clearly incorrect. 



Specifically, we will relate the local approximation of a curve in three space 
to its projection in the left and right image planes. The behavior of a planar 
curve can be described in terms of its tangent and curvature. Similarly, a curve 
in three space can be described by the relationships between its tangent, normal 
and binormal. As the curve moves across depth planes, there exists a positional 
disparity between the projection of the curve in the left image and the projection 
in the right image. However, there also exist higher order disparities, for example 
disparities in orientation, that occur. It is these types of relationships that can 
be capitalized upon when solving the correspondence problem. One approach 
is to relate them directly to surface slant and tilt Rather than defining 
a distance measure that compares the positions or orientations of a left/right 
edge-element pair, we will require that neighboring pairs be locally consistent. 
That is to say: (i) we shall interpret edge elements as signaling tangents; and 
(ii) we shall require that there exists a curve in three space whose projection in 
the left and right image planes is commensurate with the locus of tangent pairs 
in a neighborhood of the proposed match. 

The reduction of the correspondence problem to relations between nearby 
tangents distributed across (left, right) image pairs opens up a second goal for 
this research - biologically plausible stereo correspondence algorithms that can 
run in orientation hypercolumns. However, space limitations preclude us from 
further developing this goal. 
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Clearly, the role of projective geometry must not be ignored. One would 
not want correspondence relations that depended in minute detail on camera 
parameters and precise camera calibration, otherwise the solution would be too 
delicate to use in practice. We seek those relationships between the differential 
structure of a space curve and the differential structure of image curves on which 
stereo correspondence is based to be invariant to small changes in the camera 
parameters. We refer to this as the Principle of differential invariance. 

2 Differential Formulation of the Compatibility Field 

Let a(s) be a curve in Assuming measurements of the curvature, torsion 
and Frenet frame of the curve at a(0), we obtain a local approximation of the 
curve by taking the third order Taylor expansion of a at s = 0. Using the Frenet 
equations and keeping only the dominant term in each component, we obtain 
the Frenet approximation of a at s = 0 (see Fig. 

2 3 

d(s) = a(0) -I- sTo + + KoToySo (1) 




Fig. 4. The Frenet approximation to a curve in at a point a(0). Without loss of 
generality, the point a(0) is assumed to be coincident with the origin. 



Another approximation can be derived for the case of a planar curve. For the 
moment let a be a unit speed planar curve with k > 0. There is one and only one 
OSCULATING CIRCLE ijj which approximates a near ci(s) up to second-order in 
the Frenet sense. This approximation was used in defining the CO-CIRCULARITY 
CONSTRAINT for determining consistency between nearby edge elements uni 
Our technical goal in this section is to generalize co-circularity to stereo. The 
following abstraction is useful: 

Transport Problem Given a unit speed curve a, determine a unit vector field, 
u, on a such that a'(s) • m(s) = const. 

For any unit speed curve, a, the vector field u = T can be defined such 
that the pointwise relation a' ■ u = 1 holds. If a curve is reconstructed from 
an image, then the reconstructed curve must satisfy this property. However, this 
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assumes-unrealistically!- that a. was known around the point o;(0). The basis for 
co-circularity is the observation that a can be locally approximated at 0 via the 
osculating circle, ip: if the tangent at the position ip{Ss) is transported along the 
osculating circle from ip{0) to ip{Ss), it should approximate the tangent at a{Ss) 
to first order. This TRANSPORT constraint was used to define compatibilities 
for a relaxation labeling process. 

For the stereo correspondence problem, two edge maps are given (one for the 
left camera and one for the right); each of these will be consistent (in the sense 
that they satisfy the monocular co-circularity transport constraint); our goal now 
is to make them consistent with a local approximation to the space curve from 
which they project. The generalization of the osculating circle for space curves 
requires an osculating helix. To begin, recall that a unit speed circular helix is a 
space curve h{s) where h{s) = (acos (|) , asin (|) , where c = \/a^ + b'^- 
Osculating helix Let a be a unit speed space curve with k > 0. There is one 
and only one unit speed circular helix, h, that locally approximates a at s in the 
sense of Frenet: h{0) = a(s);T^(0) = Tq(s);7V^(0) = A/'q(s);k^ = Ka{s);Tj^ = 
Ta{s). 

Now, consider a point M G There exists a family of unit speed smooth 
space curves that pass through M. For each such curve, it is possible to construct 
an osculating helix that locally approximates the curve at M . Each such appro- 
ximation is only valid in a small neighborhood around M (denoted Af{M)). The 
projection of M onto the left image plane is mi = Pi{M). Similarly, the pro- 
jection of the neighborhood around M is a neighborhood around mi, Af{mi) = 
Pi{Af{M)), where Af{mi) is a connected subset of the image plane. The neighbor- 
hood of the stereo pair {mi,mr) = (Pi{M), Pr{M)) is the set of corresponding 
points in the space Af{mi) x J\f{mr). 

Let M be a point in and Tm the tangent vector at M. The mapping S 
is the projection of M and Tm onto the left and right image planes: 

S : X T{E^) hA X X T(F;2) x T{E^) 

We refer to the space x E"^ xT{E‘^) xT{E‘^) as the stereo tangent space and de- 
note it with the symbol -d. A point i G -d can be represented as {xi,yi,Xr, j/r, dp Or) 
where x and y are the projection of M in the image plane and d is the orienta- 
tion of the projected tangent in T{E‘^). The mapping S is invertible everywhere 
except for the line between the optical centers and the focal planes. 

Two image points, one from the left image plane and one from the right, are 
referred to as corresponding points if the inverse projection of the pair is a point 
on some object in the scene. 

We also define the projection map: 

tt-.E^xE'^x T{E^) X T{E^) ^ E^ X E^ 
Tr{xi,yi,Xr,yr,0i,0r) = {xi, yi, Xr, yr) 

The map P“^(7r(i)) maps the stereo tangent pair, i, to the corresponding 
point in three space. The map <S'“^(i) maps the stereo tangent pair, i, to a point 
in three space along with its associated tangent. 
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Since a at M can be approximated by an osculating helix at M, the image of 
the curve a in a neighborhood around (mi,rrij.) = (Pi{M), Pr{M)) is commen- 
surate with the projection of the osculating helix into the left and right image 
planes. Furthermore, the direction of corresponding tangents in the left and right 
image planes is approximately equal to the direction of the projected tangents 
along the osculating helix. Each pair of corresponding points and tangents is 
referred to as a stereo tangent pair. If i is a stereo tangent pair, i € d, associated 
with a point on an arbitrary curve in three space, then in a neighborhood around 
i, the set of stereo tangent pairs will be commensurate with the positions and 
tangents of the projected osculating helix. We denote the neighborhood around 
i as Af{i), where A/’(z) = {j G d\Tr{j) G A/’(7t(i))}. Formally, 

Stereo Compatible; Let i G d and j G Af{i). {i,j} are stereo- compatible under 
S if and only if 3/i(s), a circular helix, such that S{h{s)) D {i,j}- We denote the 

compatibility of i and j under the helix h as {i ^ j). 

Figure 0 depicts the projection of several helices onto the left and right 
image planes, each of which is locally consistent (according to the co-circularity 
transport constraint). Equivalently, for a pair of nearby points that lie on the 
curve in the left image, for example mi and ni, the position and tangent at ni 
satisfy the transport constraints at mi. The same property holds for and n^, 
two nearby points on the curve in the right image plane. In addition, for each 
pair of neighboring stereo pairs, m = (mi,mr) and n = (ni,nr), the inverse 
projection of the corresponding points lie on the original helix and the inverse 
projection of the tangents at m and n have the same orientation as the tangent 
in three space. This is true for every pair of corresponding points that lie on the 
perspective projection of the helix. Since the inverse projections P~^{m) and 
P~^{n) lie on a common helix, the transport constraints at m are necessarily 
satisfied, when we transport along the image of the helix from n to m. Therefore, 
we can express the compatibility between a pair of corresponding points as a 
compatibility relationship between stereo tangent pairs. However, one cannot 
arbitrarily pick a curve in the left image and a curve in the right image and 
expect there to be a helix in 3?^ that projects to the given curves. Only those 
helices that satisfy the compatibility relation for a given stereo tangent pair are 
admissible pairings of curves in left and right images. It is the difference in the 
projection of the helices that allows us to solve the correspondence problem. In 
fact, we have the uniqueness lemma: 

Let zi, i 2 G d such that P~^{n{ii)) = P“^(7r(z2)) and S'“^(zi) ^ S~^{i 2 ). There 
does not exist a circular helix h and j G Af{ii) = A/"(z 2 ), such that (zi ~ j) and 
(*2 " j ). 

Proo/ Suppose there exist h such that (zi ~ j) and (z 2 j). From the definition 

of compatibility S~^{j) lies on the helix and the helix h passes through M = 
P“^(7r(zi)) = P“^(7r(z2)). We transport along the helix from P“^(7r(j)) to M. 
We denote the tangent at M as Tm- Since (zi j), ii G S{h) and the projection 
of Tm is the tangent at zi. But (z2 ~ j), so Z2 G S{h) and the projection of the 
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Tm is the tangent at 12 - But the tangent at i\ ^ tangent at Z 2 and therefore 
Tm has two values. 



Helices Proj in the L. Image Proj in the R. Image 




Fig. 5. Four helices in three space whose projection pass through the point i = 
(0, 0, 0, 0, 7t/2, 7t/ 2). The projections in the left and right image planes are shown. 

The COMPATIBILITY FIELD for z G I? Under S is the set of all pairs (j,hj), 
where j G A/”(z) and hj is a unit speed circular helix, such that the relation 

(z ~ j) holds for the camera model S. We denote the compatibility fields for a 
point z as CF 5 (z). 

The solution to the correspondence problem can now be framed in terms of 
the local approximation of curves. Consider a point M on a smooth curve in three 
space and a pair of points, not necessarily corresponding points, in the image 
plane m = {mi,mr} (the subscripts, respectively, indicate the left and right 
components of the pair). Earlier, it was shown that there exists an osculating 
helix to the curve at M. Therefore, mi and are a match if and only if the 
image of the osculating helix in the left and right image planes is coincident with 
the locus of edge points around m; and mr- 

Formally, solving the correspondence problem is equivalent to solving the 
transport problem. Consider two arbitrary curves, one in the left image plane 
and one in the right. Pick two tangents, (mi,0i) and (mr,0r) such that they lie 
on matching epipolar lines in the left and right image planes. The two curves 
can be expressed as single curve /3 in z?. Let /3(0) = {mi,mr,0i,9r) and P{Ss) = 
{mi,Thr,9i,9r) be two points along /?. (mi,0i) and (rrir,9r) are corresponding 
tangents if there exists a helix in that satisfies the transport criteria at /3(0) 
and P{Ss). That is to say, there exists a unit speed osculating helix, h £ H, 
such that S(h(0),T/i(0)) = /3(0) and S{h{5t),Th{6t)) « P{Ss). These transport 
criteria are expressed as the position and tangent direction of the projected helix. 

The compatibility field around a point i = {mi,mr, 9i,9r) G d is the set of all 
stereo tangent pairs in a neighborhood of i for which the transport constraints 
can be satisfied. To draw this in a figure we separate the left components of 
z, {mi,9i), from the right components {mr,9^). The result is what appears to 
be a tangent field surrounding the point for which the compatibility field was 
constructed. Each tangent in the left compatibility field has a matching tangent 
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in the right field. However, there may be a disparity in both the position and 
orientation between the matching tangents. It is the differences in position and 
orientation between corresponding tangents in the left and right compatibility 
fields that are the key to solving the correspondence problem. This relationship 
is captured directly in the compatibility fields. 

However, we are now at the place where the projective geometry (the camera 
model) interacts with the differential geometry: 

Observation: Given a stereo tangent pair i, the equations used to deter- 
mine the osculating helix to the curve a at a(S'“^(z)), are underdetermined. For 
a stereo tangent pair, it is possible to invert the perspective projection matrices 
to obtain the position, M, and tangent of the curve in However, given only 
a stereo tangent pair, it is not possible to determine the direction of the nor- 
mal, curvature or torsion. Therefore, there does not exist enough information to 
uniquely determine the helix passing through M. The family of possible helices 
passing through M is homeomorphic to the space x R+ x R (where deno- 
tes the unit circle). In the next section, we will provide an argument based on 
the principle of differential invariance to consider only a subset of the helices in 
S'! X R+ X R. 

2.1 The Invariant Compatibility Field 

The construction of a compatibility field requires a precise characterization of 
the camera (intrinsic/extrinsic) parameters. It would be exhausting if the com- 
patibility fields had to be recomputed for every set of camera parameters; it 
would be unrealistic to assume that the camera parameters could be obtained 
with arbitrary precision. We seek an Invariant Compatibility Field: Consi- 
der the mappings Si and S 2 where each mapping is constructed using a different 
set of camera parameters. Let i £ d. For every j £ CF (i) 3hj £ % such that 
Si{hj) D {*,j}. The CFsj(t) is invariant to changes in the camera parameters 
if for all h that satisfy Si{h) D {i,j}, S'^^S'i(/i) G T~L. 

To simplify the analysis, we assume that only the camera parameters 9, dx 
and dy are variable. Further we assume that the fixation point of the camera, 
{xo,yo)Zo) is such that Zo > Xo, Zo > Vo and Zo ^ dx. The last restriction 
constrains the value of 9 to be close to zero. Lastly, we assume that the cameras 
are symmetrically verged at the fixation point. 

If the compatibility field is computed for a small neighborhood around a point 
i £ '9, then the affine projection equation can be used to study the invariance of 
the compatibility fields under changes in the camera parameters. Specifically, we 
determine if a helix that satisfies the compatibility criteria between two points in 
d is preserved under changes in the camera parameters by studing the deforma- 
tion of the spherical image of a circular helix under F = oHi*, where Hi* is 
the affine derivative map of S'!. We refer to F as the affine stereo transformation. 

We can use the spherical image of a helix, a, to define a circular cone in 
i?3, Xcr(u,r') = ua{v). Let a be a unit speed cylindrical helix. For every pa- 
rameterization (3 of a, the image of f3' lies on a cone defined by the spherical 
image of a. This can be proved by defining (3 as j3{t) = a{r(t)) and consequently 
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= r'{t)a'{r{t)). Fix s S (0, L), where L is the arc-length of a. We can write 
= r'{t*)a'{s) where t* = r~^(s). a'(s) is a point on (unit sphere) re- 
presented by the vector v so = r'{t*)v. Thus for any reparameterization 

by a function r, is a point along the ray defined by v and v lies on the 

cone defined by the spherical image of a. 

Thus an equivalence relation can be defined on the family of cylindrical heli- 
ces. Each equivalence class consists of all helices that are the reparameterizations 
of a given unit speed cylindrical helix. The spherical image of every helix in the 
equivalence class lies on the cone defined by the spherical image of the unit speed 
helix. We require the following results: 

1. The image of a circular cone under a linear mapping E is a generalized cone. 

2. By the Jordan curve theorem, the curve a has a well defined interior. Consider 
the closed set, C, where a = dC. Suppose the set C is convex. The image of 
this set under the linear mapping F is also convex |3j. Therefore, if a{v) has 
curvature less than zero, the curve a{v) also has curvature less than zero. This 
implies that x(rt, v) is always bending away from the normal (if the cone is 
oriented with outward normals) and that the normal curvature is less than or 
equal to zero everywhere along the generalized cone. Along the rulings of the 
cone, the principal curvature ki = 0. The other principal curvature is k 2 is less 
than zero. 

The above lemmas allows us to phrase the problem of invariance in terms of 
deformations of the circular cone: 

Theorem The compatibility field is invariant if V/i S H, F{xa) G C, where 
C is the family of all circular cones, yi^ G C and a is the spherical image of the 
circular helix. 

A global property of a circular cone is that there exists a vector a such that 
the angle between x(rt, v) and a is a constant, Z(x, a) = k,\/u > 0,v G K. If the 
cone E(x) is a circular cone then: 

Z{F{x{u,v)),F{a) ) = k 

, A 9Z(E(x(u,u)),F(a) ) 

\{u, v) = — = 0 (2) 

The function A is a smooth continuous function because the surface of the cone 
is smooth. Since A is a function of the derivative map A*, it is also a function 
of the camera parameters, e.g. 6i and 9^.. If the camera parameters defined by 
Ai are fixed, and if initially A 2 = Ai, the derivative of A with respect to some 
camera parameter of A 2 measures the invariance of the compatibility field to 
changes in that camera parameter. If for example ^ is non-zero, then the helix 
is deformed by the mapping A^^Ai as 9i varies. Therefore, A can be used to 
study which unit speed circular helices are invariant to changes in the camera 
parameters. 

Consider the set of all unit speed circular helices. Ho, that lie on a circu- 
lar cylinder whose axis goes through the origin. Let the direction of the axis 
represent the direction of the helix. The direction of every helix in Ho can be 
represented by a point on S^. We can change the direction of each helix with a 
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rotation matrix R, which has two degrees of freedom. Consider the point (0, 0, 1) 
on 5^. Any other point on can be defined by a rotation around the x-axis 
followed by a rotation around the y-axis. We define the mapping i? : 5^ H> 5^ 
as such a function. The mapping i? is a smooth continuous function P]. The 
mapping R changes the direction of a helix and rotates the spherical image of 
the helix accordingly. If a\ is the spherical image of unit speed helix, hi C 'Hoi 
and we define /12 = R{h\)^ then by the linearity of differentiation the spherical 
image of is <J 2 = .R(o'i). 

Equation El can be used to study the invariance of the helix hi by considering 
its spherical image CTi . Similarly, it is possible to study the invariance of /i 2 using 
the spherical image CT 2 = R{(Ji ) . Therefore, it is possible to study the invariance 
of an entire family of helices, each with a different direction by considering 
different rotation matrices. There is a natural symmetry on (unit sphere) 
based on the fact that the deformation associated with a point on and its 
conjugate are equal. Figure 0 shows an example of the surface that is generated 
by considering the deformation for a family of helices, a subset of Rq, that 
have the same value of curvature and torsion but whose directions vary. The 
locus of points along the bottom of the valley represent the helix whose axis is 
defined by the vector (0, 0, 1). We will denote this helix as the z-helix. The figure 
demonstrates that the z-helix is most stable to small changes in the camera 
parameter 9i . A similar observation is made when other camera parameters are 
varied. 




Fig. 6. The degree of deformation, normalized between 0 and 1, for helices pointing 
in different directions. The direction of each helix is represented by a rotation around 
the X and y axis of the standard Euclidean frame. The variables x and y represent a 
rotation about the x and y axis respectively. The helix with Odeg of rotation around 
the x-axis is referred to as the z-helix. The value of curvature and torsion for each helix 
in the family is and See text for details. 



Figure |3Top shows three helices which are oriented along three different and 
mutually perpendicular axes. The degree of invariance to changes in 9i for each 
of the helix directions are shown at the bottom of the figure. Of the three helices, 
the z-helix is the most stable to changes in the camera parameters for most ratios 
of A A similar observation is made for small changes in 9^., dx and dy. 

It was stated earlier that any solution to the stereo correspondence problem 
must satisfy the principle of differential invariance. In figure Q the magnitude of 
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Fig. 7. Top: Shown are three helices in , denoted x-helix, y-helix, z-helix from left 
to right respectively. Bottom: The degree of deformation of each helix type as the left 
camera angle is slightly perturbed. The maximum amount of deformation of each helix 
type is shown as a function of the ratio t/k. The line of sight is along the z-axis. The 
z-helix, the rightmost helix, is significantly more stable under camera perturbations 
than the other two. 



the maximum deformation of the x- and y-helix are significantly larger compa- 
red to the z-helix. Therefore, of the three helices, the z-helix comes closest to 
satisfying the principle of differential invariance. 

The orientation disparity encoded by a stereo tangent pair can now be in- 
terpreted in terms of the physical behavior of a z-helix in three space. If the 
amount of torsion is equal to zero, the helix is confined to the plane of fixation 
and the orientation disparity is necessarily zero |0|. As the torsion increases, the 
z-helix begins to twist out of the plane of fixation. The tangent at the point 
where the helix crosses the plane of fixation is no longer parallel to the plane. 
This difference induces an orientation disparity between the projection of the 
tangent in the left eye and the tangent in the right eye. 

Two technical details remain, that are beyond the space limitations in this 
paper. First, the choice of the z-helix implies that it can be used as the local ap- 
proximation to an arbitrary space curve. Unfortunately, this requires the Frenet 
frame of the z-helix to be oriented in the same direction as the curve’s frame. 
However, it can be shown that, for appropriately selected values of curvature 
and torsion, the z-helix is a valid approximation to a curve under the limits 
induced by the quantization of position, orientation and curvature in the image 
planes. Second, the sampling of the stereo tangent space may introduce com- 
plications because the inverse projection of a quantized stereo pair may not be 
defined. However, it can be shown that, for suitable couplings between positon 
and orientation quantization, a suitable pseudoinverse can be defined (Fig. 0. 
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Fig. 8. Top: The positive discrete compatibility field for the point % = 

(0, 0, 0, 0, 7t/2, 7t/ 2) projected in the left and right image planes. Bottom: The posi- 
tive discrete compatibility field for the point i = (0, 0, 0, 0, tt/3, tt/ 3) projected in the 
left and right image planes. 

2.2 The Stereo Relaxation Labeling Process 

We can now use the discrete compatibility fields to select pairs of points in the 
stereo tangent space that are most consistent with one another. For motivation it 
helps to recall the model for monocular edge consistency via relaxation labeling 
with co-circular compatibilities. Abstractly, let a set of nodes (image positions) 
be given, and a set of labels (edge orientations) be defined for each position. 
Further, let the labels be ordered at each node by a probability measure Pi{X), 
indicating the probability that label A is correct for node i, with ~ 

IVF Labels are ’’selected” at each position by an iterative gradient ascent that 
extremizes the functional A{p) = ^')Pj{^') parallel for all nodes 

i and all labels A. The compatibilities rij{X,X') derive from co-circularity^^. 
For full description see 0. In effect, the relaxation network selects those edges 
that, if transported along the osculating circle locally, would most match their 
neighbors. 

A specialization of the above to the two-label relaxation labeling process was 
developed to allow multiple labels at a r)osition|lbj. and it is this specialization 
that will be used to disambiguate the stereo tangent pairs. Let I = {1, ... , n} be 
a set of nodes. Each node is assigned two labels A G {TRUE, FALSE} where Pi{X) 
represents the confidence in the label at node i and Pi(TRUE) -l-pi(FALSE) = 1. 
Since it is sufficient to know the value of pi(TRUE) to determine pi (FALSE), 
only values of pi(TRUE) need be updated. For notational convenience, let pi = 
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Pi (TRUE). The update rule is: 



= [P\ + %]o 



( 3 ) 



j 



where 




1, if a; > 1 
X, if 0 < a: > 1 
0, if X < 0 



We denote qi as the support at pi and R = {rij} as the compatibility matrix. 

We use the two-label relaxation process to infer the correspondences between 
tangents in the left and right tangent fields. Let each kernel point, t G d, he a 
node in the relaxation labeling network, where the t denotes discretization. Thus, 
each node in the network encodes absolute horizontal disparity, vertical dispa- 
rity, and orientation disparity. Intuitively, the value of pj(TRUE) represents the 
confidence that the stereo tangent pair i, is a correct correspondence. The initial 
labeling assignment is the set of all possible correspondences in the image plane. 
The value of the compatibility matrix element rjj is the discrete compatibility 
between the pair i, j. As a consequence, the set of all discrete compatibility fields 
is commensurate with the compatibility matrix R. 

3 Results 

The first step in our stereo process is to obtain a discrete tangent map repre- 
sentation of the stereo pair. Figure Elis the reconstruction of the scene from the 
synthetic stereo pair using our approach to correspondence matching (without 
curvature information) . In those few places where the differential geometric con- 
straints are inadequate, we select those matches that lie closest to the fixation 
plane. The most significant observation with respect to the reconstruction is that 
the change in the ordering of primitives, in this case discrete tangent elements, 
does not effect the matching algorithm. Two curves, a y-helix and an arc, are 
both successfully segmented and rendered at the correct depth. Another signifi- 
cant difference between the TINA reconstruction and ours is that our curves are 
both isolated in space. This difference is due in part to the failure of the PMF 
matching algorithm near junctions in the image. We handle such structures in 
the discrete tangent map by representing multiple tangents at a given (image) 
point. At image junctions, the tangent map encodes two separate tangent direc- 
tions, one for the arc and the other for the helix. The multiple tangents at the 
junction in the left image participate in the matching process with the appro- 
priate tangents at the junction in the right image. This ensures that the parts of 
the helix in the scene that correspond with the junction areas in the image have 
valid matches in the left and right image planes. The same fact also applies for 
the arc. 
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The errors in matching induced by the lack of ordering are further exagge- 
rated when the helix is farther out in front of the circular arc. This is because 
the image extent increases over which the ordering constraint is invalid. Fig. 0 
contains the same curves as the previous stereo pair but exaggerates the parts of 
the image where the ordering constraint breaks down. Predictably, the number 
of pixels that are falsely matched using the PMF algorithm increases (Fig. |^. 
The reconstruction of the scene as computed by TINA is shown in Fig. 3. 

In contrast to the TINA reconstructions, our stereo algorithm returns a more 
veridical reconstruction of the original scene (see figure E|) . As in the previous 
stereo pair, the only constraint used in the computation was a bias for matches 
close to the fixation plane. The stereo reconstruction results in two curves, an 
arc and a y-helix that are separated in depth. 



Fig. 9. The reconstruction of the stereo scene from the stereo pair in Fig. Ois shown 
from three separate projections. The color of each space tangent represents its depth 
(red is closest to the camera and blue the farthest away). Note that the approximate 
geometry of the curves in three space is reconstructed and that the curves are localized 
at distinct depths. The result is obtained after 5 iterations of relaxation labelling and 
(Jc = 1.75. 




Fig. 10. A stereo pair of Asiatic lilies. The cameras had a 9cm baseline. The stereo 
pair can be fused using uncrossed fusing. 
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Fig. 11. The discrete tangent map of the lilies in Fig^3 The tangent map was created 
using logical/linear operators |T2] and a relaxation labelling process followed by 
a deblurring function (after |1]). The quantization of space and orientation has been 
improved by six times in both dimensions. 




Fig. 12. The depth map associated with the stereo lily pair. The color bar at the right 
indicates the depth associated with the colors in the image. Each point in the image is 
a tangent in three space that is geometrically consistent with its neighbors. 




Fig. 13. Left: A detail of a flower from the stereo reconstruction. The colored depth 
map is shown to the right of the image. The red curve segments are false matches that 
are spatially located in front of the flower. The graduated coloring along the petals 
indicates a smooth transition along the petals. Right: A cropping plane 1.01m away 
from the cyclopean center is inserted to show the details of objects in the foreground. 
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The gaps in the stereo reconstructions are due mainly to the inability match 
horizontal tangents. Image tangents whose orientations are less than 20deg from 
the horizontal are not used in the stereo reconstruction. There are some false 
matches as shown in the left panel of the above figures. 

We have shown that a key assumption used in algorithms like PMF is the 
ordering constraint. In the synthetic images we showed that if the ordering con- 
straint is violated then the PMF algorithm performs poorly. In contrast, our 
stereo algorithm is relatively unaffected by the lack of ordering in parts of the 
image. The stereo pair of twigs in Fig. Qshows a real world situation where the 
ordering constraint breaks down but where our algorithm succeeds (Fig. EJ- 
While there are false matches associated with the twigs farthest in the backgro- 
und, they are few in number and are often short in length. They occur in this 
implementation because preference is given to matches closest to the fixation 
plane which is in front of the receding twigs. 




Fig. 14. Left: The stereo reconstruction of the twigs scene. There are some mismatches 
near the base of the twigs and along the branch receding in depth. The reconstruction 
is not affect by those areas where there is failure of the ordering constraint. Middle: A 
cropping plane 1.03m from the cyclopean point was placed to highlight the twigs in the 
foreground. Right: A birdseye view of the reconstructed scene. The twigs are arranged 
in roughly three layers as can be seen by fusing the stereo pair. 



In summary, we have shown how the differential geometry of curves in mo- 
nocular images can be made consistent with (the Frenet estimate of) the spatial 
curve from which they project to define a relaxation labeling algorithm for stereo 
correspondence. In effect we have generalized co-circularity for planar curves to 
co-helicity for spatial curves. Invariance to camera-model uncertainties dictates 
the z-helix from among a family of possible space helices, and a unique feature 
of this correspondence is the natural way that orientation disparity combines 
with positional disparity. Examples illustrate how our system functions even 
in situations when ordering (and other heuristic) constraints break down. The 
evolutionary pressures on early tree-dwelling mammals, in particular early pri- 
mates, suggests that uncertainties in reaching for a branch while swinging across 
space could not be tolerated. Given the mapping of our stereo tangent space onto 
orientation hypercolumns, perhaps our algorithm also implies a mechanism for 
stereo correspondence in visual cortex. 
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Abstract. For grey-value images, it is well accepted that the neighbor- 
hood rather than the pixel carries the geometrical interpretation. Inte- 
restingly the spatial configuration of the neighborhood is the basis for 
the perception of humans. Common practise in color image processing, 
is to use the color information without considering the spatial structure. 
We aim at a physical basis for the local interpretation of color images. 
We propose a framework for spatial color measurement, based on the 
Gaussian scale-space theory. We consider a Gaussian color model, which 
inherently uses the spatial and color information in an integrated mo- 
del. The framework is well-founded in physics as well as in measurement 
science. The framework delivers sound and robust spatial color invari- 
ant features. The usefulness of the proposed measurement framework 
is illustrated by edge detection, where edges are discriminated as sha- 
dow, highlight, or object boundary. Other applications of the framework 
include color invariant image retrieval and color constant edge detection. 



1 Introduction 

There has been a recent revival in the analysis of color in computer vision. This 
is mainly due to the common knowledge that more visual information leads to 
easier interpretation of the visual scene. A color image is easier to segment than a 
grey- valued image since some edges are only visible in the color domain and will 
not be detected in the grey- valued image. An area of large interest is searching 
for particular objects in images and image-databases, for which color is a feature 
with high reach in its data-values and hence high potential for discriminability. 
Color can thus be seen as an additional cue in image interpretation. Moreover, 
color can be used to extract object reflectance robust for a change in imaging 
conditions iMm. Therefore color features are well suited for the description 
of an object. 

Colors are only defined in terms of human observation. Modern analysis of 
color has started in colorimetry where the spectral content of tri-chromatic sti- 
muli are matched by a human, resulting in the well-known XYZ color matching 
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functions m- However, from the pioneering work of Land m we know that 
a perceived color does not directly correspond to the spectral content of the 
stimulus; there is no one-to-one mapping of spectral content to perceived color. 
For example, a colorimetry purist will not consider brown to be a color, but as 
computer vision practisers would like to be able to define brown in an image 
when searching on colors. Hence, it is not only the spectral energy distribution 
coding color information, but also the spatial configuration of colors. We aim at 
a physical basis for the local interpretation of color images. 

Common image processing sense tells us that the grey-value of a particu- 
lar pixel is not a meaningful entity. The value 42 by itself tells us little about 
the meaning of the pixel in its environment. It is the local spatial structure of 
an image that has a close geometrical interpretation H31. Yet representing the 
spatial structure of a color image is an unsolved problem. 

The theory of scale-space HM adheres to the fact that observation and 
scale are intervened; a measurement is performed at a certain resolution. Dif- 
ferentiation is one of the fundamental operations in image processing, and one 
which is nicely defined Pj in the context of scale-space. In this paper we discuss 
how to represent color as a scalar field embedded in a scale-space paradigm. As 
a consequence, the differential geometry framework is extended to the domain 
of color images. We demonstrate color invariant edge detectors which are robust 
to shadow and highlight boundaries. 

The paper is organized as follows. Section0considers the embedding of color 
in the scale-space paradigm. In Sect. 0 we derive estimators for the parameters 
in the scale-space model, and give optimal values for these parameters. The 
resulting sensitivity curves are colorimetrical compared with human color vision. 
Sectional demonstrates the usefulness of the presented framework in physics 
based vision. 

2 Color and Observation Scale 

A spatio-spectral energy distribution is only measurable at a certain spatial 
resolution and a certain spectral bandwidth. Hence, physical realizable measu- 
rements inherently imply integration over spectral and spatial dimensions. The 
integration reduces the infinitely dimensional Hilbert space of spectra at infini- 
tesimally small spatial neighborhood to a limited amount of measurements. As 
suggested by Koenderink im, general aperture functions, or Gaussians and its 
derivatives, may be used to probe the spatio-spectral energy distribution. We 
emphasize that no essentially new color model is proposed here, but rather a 
theory of color measurement. The specific choice of color representation is ir- 
relevant for our purpose. For convenience we first concentrate on the spectral 
dimension, later on we show the extension to the spatial domain. 

2.1 The Spectral Structure of Color 

From scale space theory we know how to probe a function at a certain scale; the 
probe should have a Gaussian shape in order to prevent the creation of extra 
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details into the function when observed at a higher scale (lower resolution) m 
As suggested by Koenderink m, we can probe the spectrum with a Gaussian. 
In this section, we consider the Gaussian as a general probe for the measurement 
of spatio-spectral differential quotients. 

Formally, let E{X) be the energy distribution of the incident light, where 
A denotes wavelength, and let G(Aq; (Jx) be the Gaussian at spectral scale cta 
positioned at Aq. The spectral energy distribution may be approximated by a 
Taylor expansion at Aq, 

E{X) = + XE^o + ^X^E^l + ... . ( 1 ) 

Measurement of the spectral energy distribution with a Gaussian aperture yields 
a weighted integration over the spectrum. The observed energy in the Gaussian 
color model, at infinitely small spatial resolution, approaches in second order to 

E'^^ (A) = + XE^°’'"^ + ^X'^E^l’'""' + ■■■ (2) 

where 

= J E(X)G{X;Xo,ax)dX (3) 

measures the spectral intensity, 

E^o,<r, ^ J E(^X)Gx{X-,Xo,ax)dX (4) 

measures the first order spectral derivative, and 

E^xl’’^^ = J E{X)Gxx{X-,Xo,<Jx)dX ( 5 ) 

measures the second order spectral derivative. Further, G\ and G\x denote de- 
rivatives of the Gaussian with respect to A. Note that, throughout the paper, 
we assume scale normalized Gaussian derivatives to probe the spectral energy 
distribution. 

Definition 1 (Gaussian Color Model). The Gaussian color model measu- 
res the coefficients E^o,<^\^ , ... of the Taylor expansion of the 

Gaussian weighted spectral energy distribution at Aq and scale a\. 

One might be tempted to consider a higher, larger than two, order structure 
of the smoothed spectrum. However, the subspace spanned by the human visual 
system is of dimension 3, and hence higher order spectral structure cannot be 
observed by the human visual system. 
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2.2 The Spatial Structure of Color 

Introduction of spatial extent in the Gaussian color model yields a local Taylor 
expansion at wavelength Aq and position Xq. Each measurement of a spatio- 
spectral energy distribution has a spatial as well as spectral resolution. The mea- 
surement is obtained by probing an energy density volume in a three-dimensional 
spatio-spectral space, where the size of the probe is determined by the obser- 
vation scale a\ and see Fig. 2] It is directly clear that we do not separately 

consider spatial scale and spectral scale, but actually probe an energy density vo- 
lume in the 3d spectral-spatial space where the “size” of the volume is specified 
by the observation scales. 



A 




Fig. 1. The probes for spatial color consists of probing the product of the spatial and 
the spectral space with a Gaussian aperture. 



We can describe the observed spatial-spectral energy density E(A, x) of light 
as a Taylor series for which the coefficients are given by the energy convolved 
with Gaussian derivatives: 



E(A, x) = E + 






Exx Ex\ 


VV 


E\x Ex\_ 




where 



( 6 ) 



E^ixi{X,x) = E{X,x) *G^ixj{X,x;crx,cr^) . 



( 7 ) 



Here, (A, x; ax, a^) are the spatio-spectral probes, or color receptive fields. 
The coefficients of the Taylor expansion of E{\, x) represent the local image 
structure completely. Truncation of the Taylor expansion results in an approxi- 
mate representation, optimal in least squares sense. 

For human vision, it is known that the Taylor expansion is spectrally trun- 
cated at second order m- Hence, higher order derivatives do not affect color as 
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observed by the human visual system. Therefore, three receptive field families 
should be considered; the luminance receptive fields as known from luminance 
scale-space m extended with a yellow-blue receptive field family measuring the 
first order spectral derivative, and a red-green receptive field family probing the 
second order spectral derivative. These receptive field families are illustrated in 
Fig.H For human vision, the Taylor expansion for luminance is spatially trun- 
cated at fourth order m 
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Fig. 2. A diagrammatic representation of the various color receptive fields, here trun- 
cated at second order. The spatial luminance only yields the well-known receptive 
fields from grey- value scale-space theory (column denoted by 0). For color vision, the 
luminance family is extended by a yellow-blue family (column 1) measuring the first- 
order spectral derivatives, and a red-green family (column 2) measuring the second- 
order spectral derivatives. 



3 Colorimetric Analysis of the Gaussian Color Model 

The eye projects the infinitely dimensional spectral density function onto a 3d 
‘color’ space. Not any 3d subspace of the Hilbert space of spectra equals the 
subspace that nature has chosen. Any subspace we create with an artificial color 
model should be reasonably close in some metrical sense to the spectral subspace 
spanned by the human visual system. 

Formally, the infinitely dimensional spectrum e is projected onto a 3d space 
c by c = A^e, where = (XY Z) represents the color matching matrix. The 
subspace in which c resides, is defined by the color matching functions A^ . The 
range 5R defines what spectral distributions e can be reached from c, and 
the nullspace H {A^^ defines which spectra e cannot be observed in c. Since any 
spectrum e = eg^ -I- decomposed into a part that resides in 5ft and a part 
that resides in H (A^), we define 

Definition 2. The observable part of the spectrum equals ejf = TTsr e where 
is the projection onto the range of the human color matching functions A^ . 
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Definition 3. The non- observable (or metameric black) part of the spectrum 
equals ea = 11^ e where is the projection onto the nullspace of the human 
color matching functions . 

The projection on the range 5i (A^) is given by P 

n?n A^ ^ U (A^) = A {A^ A)~^ (8) 

and the projection on the nullspace 

{A^) = I-A {A^Ay'^ = n^ ■ (9) 

Any spectral probe that has the same range as A^ is said to be colorimetric 
with A^ and hence differs only in an affine transformation. An important pro- 
perty of the range projector iJjf is that it uniquely specifies the subspace. Thus, 
we can rephrase the previous statement into: 

Proposition 4. The human color space is uniquely defined by 5R {A^) . Any 
color model is colorimetric with if and only if 5i (A^) = 5R (S^) . 

In this way we can tell if a certain color model is colorimetric with the 
human visual system. Naturally this is a formal definition. It is not well suited 
for a measurement approach where the color subspaces are measured with a 
given precision. A definition of the difference between subspaces is given by jSl 
Section 2.6.3], 

Proposition 5. The largest principle angle 9 between color subspaces given by 
their color matching functions A^ and equals 

9{A^ , B^) = arcsin (||JJ (A^) — K H 2 ) • 



Up to this point we did establish expressions describing similarity between 
different subspaces. We are now in a position to compare the subspace of the 
Gaussian color model with the human visual system by using the XYZ color 
matching functions. Hence, parameters for the Gaussian color model may be 
optimized to capture a similar spectral subspace as spanned by human vision, 
see Fig. 13 Let the Gaussian color matching functions be given by G{\q,(Jx). 
We have 2 degrees of freedom in positioning the subspace of the Gaussian color 
model; the mean Aq and scale a\ of the Gaussian. We wish to find the optimal 
subspace that minimizes the largest principle angle between the subspaces, i.e.: 



B(Aq, CT;,) — (G(A; Aq, ct;,)Ga(A; Ao,(Ta)) 

sin0 = argminf 3? (A^) — (H(Ao, cta)^) 1 
Ao.<tx V ^ ^ 2/ 



An approximate solution is obtained for Aq = 520 nm and ax = 55 nm. The 
corresponding angles between the principal axes of the Gaussian sensitivities and 
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the 1931 and 1964 CIE standard observers are given in Tab.m Figure^ shows the 
different sensitivities, together with the optimal (least square) transform from 
the XYZ sensitivities to the Gaussian basis, given by 



■ E ' 

Ex 

.E\x_ 



/-0.019 


0.048 


0.011 \ 


Y' 


0.019 
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-0.016 
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V 0.047 


-0.052 


0 / 


_Z_ 



( 10 ) 



Since the transformed sensitivities are a linear (affine) transformation of the 
original XYZ sensitivities, the transformation is colorimetric with human vision. 
The transform is close to the Hering basis for color vision HD], for which the 
yellow-blue pathway indeed is found in the visual system of primates [2 • 




Fig. 3. Cohen’s fundamental matrix K for the CIE 1964 standard observer, and for the 
Gaussian color model (Aq = 520 nm, a\ — 55 nm), respectively. 



Table 1. Angles between the principal axes for various color systems. For determining 
the optimal values Ao,cta, the largest angle 6i is minimized. The distance between the 
Gaussian sensitivities for the optimal values Aq = 520 nm, a\ = 55 nm and the different 
CIE colorimetric systems is comparable. Note the difference between the CIE systems 
is 9.8°. 
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Fig. 4. The Gaussian sensitivities at Aq = 520 nm and a\ = 55 nm (left). The The best 
linear transformation from the GIE 1964 XYZ sensitivities (middle) to the Gaussian 
bases is shown right. Note the correspondence between the transformed sensitivities 
and the Gaussian color model. 



4 Results 



4.1 The Gaussian Color Model by a RGB- Camera 

A RGB-camera approximates the CIE 1931 XYZ basis for colorimetry by the 
linear transform CH 



A' 


/ 0.621 0.113 0.194\ 


'R 
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\-0.009 0.027 1.105/ 
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( 11 ) 



The best linear transform from XYZ values to the Gaussian color model is 
given by (jfg. llOII . A better approximation to the Gaussian color model may be 
obtained for known camera sensitivities. Figure0 shows an example image and 
its Gaussian color model components. 




Fig. 5. The example image (left) and its color components E, E\, and E\\, respectively. 
Note that for the color component E\ achromaticity is shown in grey, negative bluish 
values are shown in dark, and positive yellowish in light. Further, for E\\ achromaticity 
is shown in grey, negative greenish in dark, and positive reddish in light. 
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4.2 Color Invariant Edge Detection 

An interesting problem in the segmentation of man made objects is the seg- 
mentation of edges into the “real” object edges, or “artificial” edges caused by 
shadow boundaries or highlights [H|. Consider an image captured under white 
illumination. A common model for the reflection of light by an object to the 
camera is given by the Kubelka-Munk theory 

E{\,x) = i{x) |/9f(a:) -h (1 - Pi{x)f i?oo(A,a;)| (12) 

where i denotes the intensity distribution, pf the Fresnel reflectance at the object 
surface, and Roo the object spectral reflectance function. The reflected spectral 
energy distribution in the camera direction is denoted by E. The quantities i and 
pi depend on both scene geometry and object properties, where Roo depends on 
object properties only. Edges may occur under three circumstances: 

— shadow boundaries due to edges in i{x) 

— highlight boundaries due to edges in pt{x) 

— material boundaries due to edges in i?oo(A, x). 

For the model given by iEa. W]\^ . material edges are detected by considering the 
ratio between the first and second order derivative with respect to A, or 

d f Ex 
dx \ Ex\ 

where E represents E{X,x) and indices denote differentiation. Further, the ra- 
tio between E{\, x) and its spectral derivative are independent of the spatial 
intensity distribution. Hence, the spatial derivative 

d_[Ex 

dx\E 



and 



dx \ E ] 

depend on Fresnel and material edges. Finally, the spatial derivatives of E, E\, 
and Ex\ depend on intensity, Fresnel, and material edges. Measurement of these 
expressions is obtained by substitution of E, Ex, and Exx for the measured 
values E, Ex, and Exx at scale ax, together with their spatial derivatives. 

Combining these expressions in gradient magnitudes yields Fig.0 

5 Conclusion 

In this paper, we have established the measurement of spatial color information 
from RGB-images, based on the Gaussian scale-space paradigm. We have shown 
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Fig. 6. Edge detection in a color image. The left figure shows edges due to object 
reflectance; the second figure includes highlight boundaries, whereas the figure on the 
right also exhibits shadow boundaries. Spatial scale cr^ = 1 pixel. The image is captured 
by a Sony XC-003P camera, white balanced under office lightning, gamma turned off. 



that the formation of color images yield a spatio-spectral integration process at 
a certain spatial and spectral resolution. Hence, measurement of color images 
implies probing a three-dimensional energy density at a spatial scale and 
spectral scale a\. The Gaussian aperture may be used to probe the spatio- 
spectral energy distribution. 

We have achieved a spatial color model, well founded in physics as well as 
in measurement science. The parameters of the Gaussian color model have been 
estimated such that a similar spectral subspace as human vision is captured. 
The Gaussian color model solves a fundamental problem of color and scale by 
integrating the spatial and color information. The model measures the coeffi- 
cients of the Taylor expansion of the spatio-spectral energy distribution. Hence, 
the Gaussian color model describes the local structure of color images. As a 
consequence, the differential geometry framework is extended to the domain of 
color images. 

Spatial differentiation of expressions derived from the Gaussian color model is 
inherently well-posed, in contrast with often ad-hoc methods for detection of hue 
edges and other color edge detectors. The framework is successfully applied to 
color edge classification, labeling edges as material, shadow, or highlight bound- 
aries. Other application areas include physics-based vision image database 
searches 0, color constant edge detection p|, and object tracking. 
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Abstract. Statistics-based colonr constancy algorithms work well as 
long as there are many colours in a scene, they fail however when the 
encountering scenes comprise few surfaces. In contrast, physics-based 
algorithms, based on an understanding of physical processes snch as hig- 
hlights and interrefiections, are theoretically able to solve for colour con- 
stancy even when there are as few as two surfaces in a scene. Unfortuna- 
tely, physics-based theories rarely work ontside the lab. In this paper we 
show that a combination of physical and statistical knowledge leads to 
a surprisingly simple and powerful colour constancy algorithm, one that 
also works well for images of natural scenes. 

From a physical standpoint we observe that given the dichromatic model 
of image formation the colour signals coming from a single uniformly- 
coloured surface are mapped to a line in chromaticity space. One com- 
ponent of the line is defined by the colour of the illuminant (i.e. specular 
highlights) and the other is due to its matte, or Lambertian, reflectance. 
We then make the statistical observation that the chromaticities of com- 
mon light sources all follow closely the Planckian locus of black-body 
radiators. It follows that by intersecting the dichromatic line with the 
Planckian locus we can estimate the chromaticity of the illumination. 
We can solve for colour constancy even when there is a single surface 
in the scene. When there are many surfaces in a scene the individual 
estimates from each surface are averaged together to improve accuracy. 
In a set of experiments on real images we show our approach delivers very 
good colour constancy. Moreover, performance is significantly better than 
previous dichromatic algorithms. 



1 Introduction 

The sensor responses of a device such as a digital camera depend both on the 
surfaces in a scene and on the prevailing illumination conditions. Hence, a single 
surface viewed under two different illuminations will yield two different sets 
of sensor responses. For humans however, the perceived colour of an object is 
more or less independent of the illuminant; a white paper appears white both 
outdoors under bluish daylight and indoors under yellow tungsten light, though 



D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. .842- 0^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



Constrained Dichromatic Colour Constancy 343 



the responses of the eyes’ colour receptors, the long-, medium-, and short-wave 
sensitive cones, will be quite different for the two cases. This ability is called 
colour constancy. Researchers in computer vision have long sought algorithms 
to make colour cameras equally colour constant. 

Perhaps the most studied physics-based colour constancy algorithms, i.e. 
algorithms which are based on an understanding of how physical processes ma- 
nifest themselves in images, and the ones which show the most (though still 
limited) functionality, are based on the dichromatic reflectance model for in- 
homogeneous dielectrics (proposed by Shafer ^3], Tominaga and Wandell UH 
ESI, and others). Inhomogeneous materials are composed of more than one ma- 
terial with different refractive indices, usually there exist a vehicle dielectric 
material and embedded pigment particles. Examples of inhomogeneous dielec- 
trics include paints, plastics, and paper. Under the dichromatic model, the light 
reflected from a surface comprises two physically different types of reflection, 
interface or surface reflection and body or sub-surface reflection. The body part 
models conventional matte surfaces, light enters the surface, is scattered and 
absorbed by the internal pigments, some of the scattered light is then re-emitted 
randomly, thus giving the body reflection Lambertian character. Interface reflec- 
tion which models highlights, usually has the same spectral power distribution as 
the illuminant. Because light is additive the colour signals from inhomogeneous 
dielectrics will then fall on what is called a dichromatic plane spanned by the 
reflectance vectors of the body and the interface part respectively. 

As the specular reflectance represents essentially the illuminant reflectance, 
this illuminant vector is contained in the dichromatic plane of an object. The 
same would obviously be true for a second object. Thus, a simple method for 
achieving colour constancy is to find the intersection of the two dichromatic 
planes. Indeed, this algorithm is well known and has been proposed by several 
authors mim- When there are more than two dichromatic planes, the best 
common intersection can be found mni. In a variation on the same theme 
Lee jHj projects the dichromatic planes into chromaticity space and then inters- 
ects the resulting dichromatic lines. In the case of more than two surfaces a 
voting technique based on the Hough transform is used. 

Unfortunately dichromatic colour constancy algorithms have not been shown 
to work reliably on natural images. The reasons for this are twofold. First, an 
image must be segmented into regions corresponding to specular objects before 
such algorithms can be employed. We are not too critical about this problem 
since segmentation is, in general, a very hard open research problem. However, 
the second and more serious problem (and the one we address in this paper) is 
that the dichromatic computation is not robust. When the interface and body 
reflectance RGBs are close together (the case for most surfaces) the dichroma- 
tic plane can only be approximately estimated. Moreover, this uncertainty is 
magnified when two planes are intersected. This problem is particularly serious 
when two surfaces have similar colours. In this case the dichromatic planes have 
similar orientations and the recovery error for illuminant estimation is very high. 
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In this paper we add statistical knowledge in a new dichromatic colour 
constancy algorithm. Here we model the range of possible illuminants by the 
Planckian locus of black-body radiators. The Planckian locus is a line in colour 
space which curves from yellow indoor lights to whitish outdoor illumination and 
thence to blue sky light. Significantly experiments show that this locus accounts 
for most natural and man made light sources. By intersecting a dichromatic 
line with the Planckian locus we recover the chromaticity of the illuminant. By 
definition our algorithm can solve for the illuminant given the image of one sur- 
face, the lowest colour diversity possible. Moreover, so long as the orientation 
of the dichromatic line is far from that of the illuminant locus the intersection 
should be quite stable. As we shall see, this implies that colours unlikely to be 
illuminants like greens and purples, will provide an accurate estimate of the il- 
lumination, whereas the intersection for colours whose dichromatic lines have 
similar orientations to the Planckian locus is more sensitive to noise. However, 
even hard cases lead to relatively small errors in recovery. 

Experiments establish the following results: As predicted by our error model, 
estimation accuracy does depend on surface colour: green and purple surfaces 
work best. However, even for challenging surfaces (yellow and blues) the method 
still gives reasonable results. Experiments on real images of a green plant (easy 
case) and a Caucasian face (hard case) demonstrate that we can recover very 
accurate estimates for real scenes. Indeed, recovery is good enough to support 
pleasing image reproduction (i.e. removing a colour cast due to illumination). 
Experiments also demonstrate that an average illuminant estimate calculated 
by averaging the estimates made for individual surfaces leads in general to very 
good recovery. On average, scenes with as few as 6 surfaces lead to excellent 
recovery. In contrast, traditional dichromatic algorithms based on finding the 
best common intersection of many planes perform much more poorly. Even when 
more than 20 surfaces are present, recovery performance is still not very good 
(not good enough to support image reproduction). 

The rest of the paper is organised as follows. Section 2 provides a brief review 
of colour image formation, the dichromatic reflection model, and the statistical 
distribution of likely illuminants. Section 3 describes the new algorithm in detail. 
Section 4 gives experimental while Section 5 concludes the paper. 



2 Background 

2.1 Image Formation 

An image taken with a linear device such as a digital colour camera is composed 
of sensor responses that can be described by 



P = 



C{X)R{X)dX 



( 1 ) 



where A is wavelength, p is a 3- vector of sensor responses (RGB pixel values), 
C is the colour signal (the light reflected from an object), and R is the 3- vector 
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of sensitivity functions of the device. Integration is performed over the visible 
spectrum w. 

The colour signal C(A) itself depends on both the surface reflectance S{\) and 
the spectral power distribution E{\) of the illumination. For pure Lambertian 
(matte) surfaces C'(A) is proportional to the product S(X)E{\) and its magnitude 
depends on the angle(s) between the surface normal and the light direction(s). 
The brightness of Lambertian surfaces is independent of the viewing direction. 

2.2 Dichromatic Reflection Model 

In the real world, however, most objects are non-Lambertian, and so have some 
glossy or highlight component. The combination of matte reflectance together 
with a geometry dependent highlight component is modeled by the dichromatic 
reflectance model 11111411,^161181 . 

The dichromatic reflection model for inhomogeneous dielectric objects sta- 
tes that the colour signal is composed of two additive components, one being 
associated with the interface reflectance and the other describing the body (or 
Lambertian) reflectance part . Both of these components can further be de- 
composed into a term describing the spectral power distribution of the reflectance 
and a scale factor depending on the geometry. This can be expressed as 

C{e, A) = mi{e)Ci{\) + mB{e)CB{\) (2) 

where Ci{\) and Cb{X) are the spectral power distributions of the interface and 
the body reflectance respectively, and m/ and ms are the corresponding weight 
factors depending on the geometry 9 which includes the incident angle of the 
light, the viewing angle and the phase angle. 

Equation (0 shows that the colour signal can be expressed as the weighted 
sum of the two reflectance components. Thus the colour signals for an object are 
restricted to a plane. 

Making the roles of light and surface explicit. Equation 0 can be further 
expanded to 



C{9, A) = mi{9)Si{\)E{\) + mB{9)SB{X)E{\) (3) 

Since for many materials the index of refraction does not change significantly 
over the visible spectrum it can be assumed to be constant. Si{X) is thus a 
constant and Equation o becomes: 

C{9, A) = mi,{9)E{X) + mB{9)SB{X)E{X) (4) 

where m/' now describes both the geometry dependent weighting factor and the 
constant reflectance of the interface term. 

By substituting equation 0 into equation (PJ) we get the device’s responses 
for dichromatic reflectances: 

p= f mr{9)E{X)R{X)dX+ [ mB{9)SB{X)E{X)R{X)dX (5) 
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which we rewrite as 




mi' (6) 




+ msiO) 

I 




( 6 ) 



where R, G, and B are the red, green, and blue pixel value outputs of the digital 
camera. Because the RGB of the interface reflectance is equal to the RGB of the 
illuminant E we rewrite (0) making this observation explicit: 



R\ 


R 


g] 


= mi'{e) G 


B 


\B 



ms(0) 



E 




(7) 



Usually chromaticities are written as the two dimensional coordinates (r, g) 
since (b = 1 — r — g). Glearly, given (0) we can write: 




+ ms(0) 

E 




B 



That is, each surface spans a dichromatic line in chromaticity space. 



( 8 ) 



2.3 Dichromatic Colour Constancy 

Equation (0 shows that the RGBs for a surface lie on a two-dimensional plane, 
one component of which is the RGB of the illuminant. If we consider two objects 
within the same scene (and assume that the illumination is constant across the 
scene) then we end up with two RGB planes. Both planes however contain the 
same illuminant RGB . This implies that their intersection must be the illuminant 
itself. Indeed, this is the essence of dichromatic colour constancy t^l8ll4llbl . 

Notice, however, that the plane intersection is unique up to an unknown sca- 
ling. We can recover the chromaticity of the illuminant but not its magnitude. 
This result can be obtained directly by intersecting the two dichromatic chroma- 
ticity lines, defined in Equation (0, associated with each surface. Indeed, many 
dichromatic algorithms and most colour constancy algorithms generally solve for 
illuminant colour in chromaticity space (no colour constancy algorithm to date 
can reliably recover light brightness; indeed, most make no attempt to do scQ.) 

Though theoretically sound, dichromatic colour constancy algorithms only 
perform well under idealised conditions. For real images the estimate of the il- 
luminant turns out not to be that accurate. The reason for this is that in the 
presence of a small amount of image noise the intersection of two dichromatic 
lines can change quite drastically, depending on the orientations of the dichro- 
matic lines. This is illustrated in Figure ^where a ‘good’ intersection for surfaces 

^ Because E{\)S{X) = ^^^kS{X) it is in fact impossible to distinguish between a 
bright light illuminating a dim surface and the converse. So, the magnitude of E{X) 
is not usually recoverable. 
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having orientations that are nearly orthogonal to each other is shown as well as 
an inaccurate illuminant estimation due to a large shift in the intersection point 
caused by noise for lines with similar orientations. Hence dichromatic colour con- 
stancy tends to work well for highly saturated surfaces taken under laboratory 
conditions but much less well for real images (say of typical outdoor natural 
scenes). In fact, the authors know of no dichromatic algorithm which works well 
for natural scenes. 



2.4 Distribution of Common Illuminants 

In practice the number and range of light sources is limited m Although it is 
physically possible to manufacture lets say a purple light, it practically does not 
exist in nature. As a consequence if we want to solve for colour constancy we 
would do well not to consider these essentially impossible solutions. Rather we 
can restrict our search to a range of likely illuminants based on our statistical 
knowledge about them. Indeed, this way of constraining the possible estimates 
produces the best colour constancy algorithms to date m- We also point out 
that for implausible illuminants, like purple lights, human observers do not have 
good colour constancy. 

If we look at the distribution of typical light sources more closely, then we find 
that they occupy a highly restricted region of colour space. To illustrate this, 
we took 172 measured light sources, including common daylights and fluores- 
cents, and 100 measurements of illumination reported in plotted them, 

in Figure |21 on the xy chromaticity diagrarr0 It is clear that the illuminant 
chromaticities fall on a long thin ‘band’ in chromaticity space. 

Also displayed in Figure 0 is the Planckian locus of black-body radiators 
defined by Planck’s formula: 



\-5 

Me = (9) 

exT — 1 

where is the spectral concentration of radiant exitance, in watts per square 
meter per wavelength interval, as a function of wavelength A, and temperature 
T in kelvins. c\ and C 2 are constants and equal to 3.74183 x 10^® Wm^ and 
1.4388 X 10“^ mK respectively. 

Planck’s black-body formula accurately models the light emitted from metals, 
such as Tungsten, heated to high temperature. Importantly, the formula also 
predicts the general shape (though not the detail) of daylight illuminations. 

3 Constrained Dichromatic Colour Constancy 

The colour constancy algorithm proposed in this paper is again based on the fact 
that many objects exhibit highlights and that the colour signals coming from 

^ The xy chromaticity diagram is like the rg diagram but x and y are red and green 
responses of the standard human observer used in colour measurement. 
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those objects can be described by the dichromatic reflection model. However, 
in contrast to the dichromatic algorithms described above and all other colour 
constancy algorithms (save that of Yuille HHI which works only under the severest 
of constraints), it can estimate the illuminant even when there is just a single 
surface in the scene. Moreover, rather than being an interesting curiosity, this 
single surface constancy behaviour actually represents a significant improvement 
over previous algorithms. 

Constrained dichromatic colour constancy proceeds in two simple steps. First, 
the dichromatic line for a single surface is calculated. This can be done e.g. by 
singular value decomposition m or by robust line fitting techniques. In the 
second step the dichromatic line is intersected with the Planckian locus: the 
intersection then defines the illuminant estimate. 

As the Planckian locus (which, can accurately be approximated by either a 
2nd (j0gj.00 polynomial or by the daylight locus ca) is not a straight line, there 
arise three possibilities in terms of its intersection with a dichromatic line, all of 
which can be solved for analytically. First, and in the usual case, there will exist 
one intersection point which then defines the illuminant estimate. However, if the 
orientation of the dichromatic line is similar to that of the illuminant locus, then 
we might end up with two intersections (similar to Yuille’s algorithm which 
represents the second possibility. In this case, several cases need to be considered 
in order to arrive at a unique answer. If we have prior knowledge about the scene 
or its domain we might be able to easily discard one of the solutions, especially 
if they are far apart from each other. For example, for face images, a statistical 
model of skin colour distributions could be used to identify the correct answer. 
If we have no means of extracting the right intersection, one could consider the 
mean of both intersections which will still give a good estimate when the two 
intersections are relatively close to each other. Alternatively we could look at 
the distribution of colour signals of the surface, if those cross the Planckian 
locus from one side then we can choose the intersection on the opposite site, 
as a dichromatic line will only comprise colour signal between the body and 
the interface reflectance. Finally, the third possibility is that the dichromatic 
line does not intersect at all with the Planckian locus. This means that the 
orientation of the dichromatic line was changed due to image noise. However, 
we can still solve for the point which is closest to the illuminant locus, and it is 
clear that there does exist a unique point in such a case. 

As we see, in each case we are able to And a unique illuminant estimate and 
so constrained dichromatic colour constancy based on only a single surface is 
indeed possible. 

In proposing this algorithm we are well aware that its performance will be 
surface dependent. By looking at the chromaticity plot in Figure Owe can qua- 
litatively predict which surface colours will lead to good estimates. For green 
colours the dichromatic line and the Planckian locus are approximately orthogo- 
nal to each other, hence the intersection should give a very reliable estimate of 
the illuminant where a small amount of noise will not affect the intersection. The 
same will be true for surface colours such as magentas and purples. However, 
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when colours are in the yellow or blue regions the orientation of the line is similar 
to that of the locus itself, and hence the intersection is very sensitive to noise. 
In order to provide some quantitative measurement of the expected accuracy of 
our algorithm we have developed error models which we introduce in the next 
section. 

3.1 Error Surfaces for Constrained Dichromatic Colour Constancy 

We define a dichromatic line in a linear RGB colour space based on a camera 
with sRGB sensitivities by its two end points, one representing the pure body 
reflection and the other being the specular part based on a certain illuminant. 
Gaussian noise of variance 0.005 and mean 0 (chromaticities must fall in the 
interval [0,1]) is added to both points before they are projected into chromati- 
city space where their line is intersected with the Planckian locus. For realism 
we restrict to colour temperatures between 2000 and 10000 kelvin. Within this 
range we And all typical man-made and natural illuminants. As the temperature 
increases from 2000K to lOOOOK so illuminants progress from orange to yellow 
to white to blue. The intersection point calculated is converted to the corre- 
sponding black-body temperature. The distance between the estimated and the 
actual illuminant, the difference in temperature, then defines the error for that 
specific body reflectance colour. In order to get a statistical quantity, this expe- 
riment was repeated 100 times for each body colour and then averaged to yield a 
single predicted error. The whole procedure has to be carried out for each point 
in chromaticity space to produce a complete error surface. 

Figure Elshows such an error surface generated based on a 6500K bluish light 
source. (Further models for other lights are given elsewhere ^.) As expected, 
our previous observations are verified here. Greens, purples, and magentas give 
good illuminant estimates, while for colours close to the Planckian locus such as 
yellows and blues the error is highest (look at Figured to see where particular 
surface colours are mapped to in the chromaticity diagram). The error surface 
also demonstrates that in general we will get better results from colours with 
higher saturation, which is also what we expect for dichromatic algorithms. No- 
tice that performance for white surfaces is also very good. This is to be expected 
since if one is on the locus at the correct estimate then all lines must go through 
the correct answer. Good estimation for white is important since, statistically, 
achromatic colours are more likely than saturated colours. In contrast the con- 
ventional dichromatic algorithm fails for the achromatic case: the dichromatic 
plane collapses to a line and so no intersection can be found. 

3.2 Integration of Multiple Surfaces 

Even though our algorithm is designed by definition for single surfaces, averaging 
estimates from multiple surfaces is expected to improve algorithm accuracy. 
Notice, however, that we cannot average in chromaticity space as the Planckian 
locus is non linear (and so a pointwise average might fall off the locus). Rather 
each intersection is a point coded by its temperature and the average temperature 
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is computed. Given the temperature and Planck’s formula it is easy to generate 
the corresponding chromaticity. 



4 Experimental Results 

In our first experiment we measured, using a spectraradiometer the light reflec- 
ted from the 24 reflectances on a Macbeth colour chart (a standard reference 
chart with 24 surfaces |2j) viewed under 6500K (bluish light) and at two orien- 
tations. The resulting spectra were projected, using Equation (1), onto sRGB 
camera curves to generate 48 RGBs. Galculating RGBs in this way (rather than 
using a camera directly) ensures that Equation (1) holds exactly: we can con- 
sider algorithm’s performance independent of confounding features (e.g. that a 
camera is only approximately linear) . Each pair of RGBs (one per surface) defi- 
nes a dichromatic line. Intersecting each dichromatic line returns an illuminant 
estimate. Taking the 24 estimates of the illuminant we can plot the recovery 
error as a function of chromaticity as done in Figure 0 and compare these errors 
with those predicted by our error model (Figure 0. The resemblance of the two 
error distributions is evident. 

From the distribution of errors, a green surface should lead to excellent reco- 
very and a pink surface, close to the Planckian locus, should be more difficult to 
deal with. We tested our algorithm using real camera images of a green plant vie- 
wed under a 6500K bluish daylight simulator, under fluorescent TL84 («5000K, 
whitish), and under tungsten light (si2800K, yellowish). The best fitting dichro- 
matic RGB planes were projected to lines on the rg chromaticity space where 
they were intersected with the Planckian locus. The plant images together with 
the resulting intersections are shown in Figures El Q andlBl It can be seen that in 
each case the intersection point is indeed very close to the actual illuminant. A 
pink face image provides a harder challenge for our algorithm since skin colour 
lies close to the Planckian locus (it is desaturated pink). However, illuminant 
estimation from faces is a particularly valuable task because skin colour changes 
dramatically with changing illumination and so complicates face recognition H21 
and tracking mg. In FigureOa face viewed under 4200K is shown together with 
a plot showing the dichromatic intersection that yields the estimated illuminant . 
Again, our algorithm manages to provide an estimate close to the actual illumi- 
nant, the error in terms of correlated colour temperature difference is remarkably 
only 200 K. 

As we have outlined in Section f3. 21 combining cues from multiple surfaces will 
improve the accuracy of the illuminant estimate. To verify that and to determine 
a minimum number of surfaces which will lead to sufficient colour constancy was 
the task of our final experiment. We took images of the Macbeth colour checker 
at two orientations under 3 lights (Tungsten (2800K, yellow) , TL84 (5000K, 
whitish) and D65 (6500K, bluish)). We then randomly selected between 1 and 
24 patches from the checker chart and ran our algorithm for each of the 3 lights. 
In order to get meaningful statistical values we repeat this procedure many times. 
The result of this experiment is shown in Figure E3 which demonstrates that the 
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curve representing the average error (expressed in K) drops significantly already 
for a combination of only a few surfaces. In Figure Eg we have also drawn a 
line at about 500 kelvin. Experiments have shown that a recovery error of 500 
K suffices as an acceptibility threshold for the case of image reproduction (for 
more details see E])- So, visually acceptable estimation is achieved with as few 
as 6 surfaces. 

Finally, we wanted to compare this performance to that of traditional colour 
constancy algorithms. We implemented Lee’s method for intersecting multiple 
dichromatic planes in chromaticity space |S|. Figure summarizes estimation 
error as a function of the number of surfaces. Notice that the conventional dichro- 
matic algorithm fails to reach the acceptability threshold. Performance is com- 
paratively much worse than for our new algorithm. 

5 Conclusion 

In this paper we have developed a new algorithm for colour constancy which is 
(rather remarkably) able to provide an illuminant estimate for a scene contai- 
ning a single surface. Our algorithm combines a physics-based model of image 
formation, the dichromatic reflection model, with a constraint on the illumi- 
nants that models all possible light sources as lying on the Planckian locus. The 
dichromatic model predicts that the colour signals from a single inhomogeneous 
dielectric all fall on a line in chromaticity space. Statistically we observe that 
almost all natural and man made illuminants fall close to the Planckian locus 
(a curved line from blue to yellow in chromaticity space). The intersection of 
the dichromatic line with the Planckian locus gives our illuminant estimate. An 
error model (validated by experiment) shows that some surfaces lead to better 
recovery than others. Surfaces far from and orthogonal to the Planckian locus 
(e.g. Greens and Purples) work best. However, even ‘hard’ surfaces (e.g. pink 
faces) lead to reasonable estimation. If many surfaces are present in a scene then 
the average estimate can be used. 

Experiments on real images of a green plant (easy case) and a Caucasian 
face (hard case) demonstrate that we can recover very accurate estimates for 
real scenes. Recovery is good enough to support pleasing image reproduction 
(i.e. removing a colour cast due to illumination). Experiments also demonstrate 
that an average illuminant estimate calculated by averaging the estimates made 
for individual surfaces leads in general to very good recovery. On average, sce- 
nes with as few as 6 surfaces lead to excellent results. In contrast, traditional 
dichromatic algorithms based on finding the best common intersection of many 
planes perform much more poorly. Even when more than 20 surfaces are present, 
recovery performance is still not very good (not good enough to support image 
reproduction) . 
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Fig. 1. Inaccuracy due to image noise in the illuminant estimation by dichromatic 
colour constancy. Only when the orientations of the dichromatic planes (here projected 
to dichromatic lines in xy chromaticity space) is far from each other (purple lines) the 
solution won’t be affected too much by noise. If however their orientations are similar, 
only a small amount of noise can lead to a big shift in the intersection point (green lines). 
The red asterisk represents the chromaticity of the actual illuminant. Also shown is the 
location of certain colours (Green, Yellow, Red, Purple, and Blue) in the chromaticity 
diagram. 




Fig. 2. Distribution of 172 measured illuminants and the Planckian locus (red line) 
plotted in xy chromaticity space. It can be seen that the illuminants are clustered 
tightly around the Planckian locus. 
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Fig. 3. Intersection of dichromatic lines with the Planckian locus for a green (or purple) 
object (green line) and for a blue (or yellow) object (blue line). While the orientations 
for green/purple objects are approximately orthogonal to that of the Planckian locus, 
for blue/yellow objects the dichromatic lines have similar orientations to the locus. 




Fig. 4. Error surface generated for D65 illuminant and displayed as 3D mesh (left) and 
2D intensity (bottom right) image where white corresponds to the highest and black 
to the lowest error. The diagram at the top right shows the location of the Planckian 
locus (blue line) and the illuminant (red asterisk). 
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Fig. 5. Error surface generated obtained from spectral measurements of the Macbeth 
Checker Chart (left as 3D mesh, right as intensity image). Values outside the range of 
the Macbeth chart were set to 0 (black). 




Fig. 6. Image of a green plant captured under D65 (left) and result of the intersection 
giving the illuminant estimate (right). The real illuminant is plotted as the red asterisk. 
The blue asterisks show the distribution of the colour signals. 
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Fig. 7. Image of a green plant captured under TL84 (left) and result of the intersection 
giving the illuminant estimate (right). The real illuminant is plotted as the red asterisk. 
The blue asterisks show the distribution of the colour signals. 




Fig. 8. Image of a green plant captured under Illuminant A (left) and result of the 
intersection giving the illuminant estimate (right). The real illuminant is plotted as 
the red asterisk. The blue asterisks show the distribution of the colour signals. 
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Fig. 9. Image of a face captured under light with correlated colour temperature of 4200 
K (left) and result of the intersection giving the illuminant estimate (right). The real 
illuminant is plotted as the red asterisk. The blue asterisks show the distribution of 
the colour signals. 




Fig. 10. Performance of finding a solution by combining multiple surfaces. The average 
estimation error expressed in kelvins is plotted as a function of the number of surfaces 
used. The horizontal line represents the acceptibility threshold at 500 K. It can be seen 
that this threshold is reached with as few as about six surfaces. 
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Fig. 11. Performance of traditional dichromatic colour constancy compared to our new 
approach. The average estimation error expressed in kelvins as a function of the number 
of surfaces used. The blue line corresponds to traditional dichromatic colour constancy, 
the red line to our new constrained dichromatic colour constancy. 
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Abstract. The light reflected from a surface depends on the scene geometry, the 
incident illumination and the surface material. One of the properties of the mate- 
rial is its albedo ( ) and its variation with respect to wavelength. The albedo of 
a surface is purely a physical property. Our perception of albedo is commonly 
referred to as colour. This paper presents a novel methodology for extracting the 
albedo of the various materials in the scene independent of incident light and 
scene geometry. A scene is captured under different narrow-band colour filters 
and the spectral derivatives of the scene are computed. The resulting spectral 
derivatives form a spectral gradient at each pixel. This spectral gradient is a nor- 
malized albedo descriptor which is invariant to scene geometry and incident illu- 
mination for diffuse surfaces. 



1 Introduction 

The starting point of most computer vision techniques is the light intensity reflected 
from an imaged scene. The reflected light is directly related to the geometry of the 
scene, the reflectance properties of the materials in the scene and the lighting condi- 
tions under which the scene was captured. One of the complications which have trou- 
bled computer vision algorithms is the variability of an object’s appearance as 
illumination and scene geometry change. Slight variations in viewing conditions often 
cause large changes in an object’s appearance. Consider, for example a yellow car seen 
in a sunny day, at night, or in dense fog. 

Many areas of computer vision are affected by variations in an object’s appearance. 
Among the most well-known problems is colour constancy, the task of consistently 
identifying colours, despite changes in illumination conditions. Maloney and Wan- 
dell[18] were the first to develop a tractable colour constancy algorithm by modeling 
both the surface reflectance and the incident illumination as a finite dimensional linear 
model. This idea was further explored by Forsyth[6], Ho et al.[14], Finlayson et al.[5, 
7, 4, 1] and Healey and Slaterjll]. colour is a very important cue in object identifica- 
tion. Swain and Ballard[23] showed that objects can be recognized by using colour 
information alone. Combining colour cues with colour constancyjll, 21, 7, 4] gener- 
ated even more powerful colour-guided object recognition systems. 

Extracting reflectance information is an under-constrained problem. All the afore- 
mentioned methodologies had to introduce some additional constraints that may limit 
their applicability. For example, most colour techniques assume that the spectral 
reflectance functions have the same degrees of freedom as the number of photorecep- 
tor classes (typically three.) Thus, none of these methods can be used in greyscale 
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images for extracting illumination invariant colour information. Furthermore, a consid- 
erable body of work on colour assumes that the incident illumination has two or three 
degrees of freedom. However, Slater and Healey[22] showed that for outdoor scenes, 
the illumination functions have seven degrees of freedom. 

We propose a new technique for extracting colour information that is invariant to 
geometry and incident illumination. We examine the rate of change in reflected inten- 
sity with respect to wavelength over the visible part of the electromagnetic spectrum. 
For diffuse surfaces, independent of the particular model of reflectance, the only factor 
that contributes to variations over the wavelength is the albedo of the surface. Thus, 
what we end up extracting is the reflectivity profile of the surface. The only assumption 
that we make is that incident illumination remains stable over small intervals in the 
visible spectrum. It will be demonstrated that this is a reasonable assumption. 

We take a greyscale image of a scene under eight different colour filters and com- 
pute the spectral derivatives of the scene. Unlike many colour constancy methods, we 
employ multiple colour filters with narrow bandwidth (eight lOnm wide filters as 
opposed to the typical three 75nm filters). The use of narrow filters increases the dis- 
criminatory power of our method. The advantage of the traditional RGB systems is 
that they resemble the human visual sensor. Unfortunately, as Hal land Greenberg[10] 
have shown, the employment of only 3 bands of wavelengths introduces significant 
colour distortions. An additional advantage of our technique over the more traditional 
band-ratios is that spectral derivatives are used on a per pixel basis. They do not 
depend on neighbouring regions, an assumption that is common in other photometric 
methods, which use logarithms and/or narrow-band filters[7]. 

The collection of spectral derivatives evaluated at different wavelengths forms a 
spectral gradient. This gradient is normalized albedo descriptor, invariant to scene 
geometry and incident illumination for smooth diffuse surfaces. Experiments on sur- 
faces of different colours and materials demonstrate the ability of spectral gradients to: 
a) identify surfaces with the same albedo under variable viewing conditions; b) dis- 
criminate between surfaces that have different albedo; and c) provide a measure of how 
close the colours of the two surfaces are. 



2 Spectral Derivative 

The intensity images that we process in computer vision are formed when light from a 
scene falls on a photosensitive sensor. The amount of light reflected from each point 
p = (x, y, z) in the scene depends on the light illuminating the scene, E and the sur- 
face reflectance S of the surfaces in the scene: 

I(p, ) = E(p, )S(p, ) (1) 

where , the wavelength, shows the dependence of incident and reflected light on 
wavelength. The reflectance function S{p, ) depends on the surface material, the 
scene geometry and the viewing and incidence angles. 
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When the spectral distribution of the incident light does not vary with the direction of 
the light, the geometric and spectral components of the incident illumination are sepa- 
rable: 



E{ , ,, ) = e{ )E{ .) 



( 2 ) 



where ( ■, ■) are the spherical coordinates of the unit-length light-direction vector 

and e{ ) is the illumination spectrum. Note that, the incident light intensity is 
included in E{ p j) and may vary as the position of the illumination source changes. 
The scene brightness then becomes: 

I(p, ) = e(p, )E(p, p .)S(p, ) (3) 

Before we perform any analysis we simplify the scene brightness equation by tak- 
ing its logarithm. The logarithmic brightness equation reduces the product of the inci- 
dent illumination E{p, ) and the surface reflectance 5(p, ) into a sum: 

L{p, ) = lne(p, ) + lnE(p, p ■) + lnS(p, ) (4) 

A colour descriptor which is invariant to viewpoint, scene geometry and incident 
illumination is albedo. Albedo ( ) is the ratio of electromagnetic energy reflected by 
a surface to the amount of electromagnetic energy incident upon the surface [20]. A 
profile of albedo values over the entire visible spectrum is a physically based descrip- 
tor of colour. Albedo is one of a multitude of factors that determine the surface reflec- 
tance function S(p, ) . 

Extracting the absolute albedo values directly from the image brightness without 
any a-priori knowledge of the materials of the scene or the imaging conditions, is an 
under-constrained problem. However, an invariant physically based colour descriptor 
can be computed by taking samples of the reflected light at different wavelengths and 
measuring the change in scene brightness between wavelengths. One technique for 
measuring such changes is by calculating the spectral derivative, which is the partial 
derivative of the logarithmic image with respect to wavelength : 



L (p. 



e (p, ) S (p, ) 
e(p, ) S(p, ) 



( 5 ) 



where e (p, ) = e(p, )/ is the partial derivative of the spectrum of the incident 
light with respect to wavelength and S (p, ) = S(p, )/ is the partial derivative 

of the surface reflectance with respect to wavelength. Our work concentrates on the 
visible part of the electromagnetic spectrum, i.e. from 400nm to 700nm. Ho, Funt and 
Drew[14] have shown, that for natural objects the surface spectral reflectance curves, 
i.e. the plots of S(p, ) versus , are usually reasonably smooth and continuous over 
the visible spectrum. 




362 E. Angelopoulou 



For diffuse objects, the spectral derivative is a measure of how albedo changes with 
respect to wavelength. As such, it is invariant to incident illumination, scene geometry, 
viewpoint, and material macrostructure. 



3 Invariance to Incident Illumination 

Although the spectral distribution of the most commonly used indoor-scene illumina- 
tions sources (i.e., tungsten and fluorescent light) is not constant, one can assume that e 
changes slowly over small increments of . This means that its derivative with respect 
to wavelength is approximately zero. 



e {p, ) 0 (6) 

An exception to this assumption are the localized narrow spikes that are present in 
the spectrum of fluorescent light (see fig. 1). The partial derivative with respect to 
wavelength is undefined at these spikes. 



dI ri.>omc#n1 




Fig. 1. The emitance spectrum of fluorescent light. 



Although different types of fluorescent light exhibit such spikes at different parts of 
the spectrum, these spikes have always very narrow width. Thus, we can discard in our 
analysis the wavelengths around which these spikes occur. By discarding these outli- 
ers, the assumption of a slowly changing e is valid over most of the visible range. This 
implies that one can safely assume that in general the partial derivative of the logarith- 
mic image depends only on the surface reflectance: 

ip, ) 



(P, ) 



S(P, ) 



(7) 
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4 Invariance to Geometry and Viewpoint 

4.1 Lambertian Model 

A very simple model that is often used by both the computer vision community and the 
graphics community is the Lambertian reflectance model. Lambert’s law describes the 
behaviour of a perfectly diffuse surface, where the reflected light is independent of 
viewpoint. For a homogeneous surface, the reflected light changes only when the angle 
of incidence -(p) between the surface normal at point p and the incident illumination 
changes. 



S(p, ) = cos -ip) (p, ) 

where (p, ) is the albedo or diffuse reflection coefficient at point p. 

Since, by definition, Lambertian reflectance is independent of viewpoint, the spec- 
tral gradient is also independent of viewpoint. Furthermore, the scene geometry, 
including the angle of incidence, is independent of wavelength. Therefore, when we 
take the partial derivative with respect to wavelength, the geometry term vanishes: 



S (P, ) _ (p, ) 

S(P, ) ~ (p, ) 



(9) 



where (p, ) = (p, )/ is the partial derivative of the surface albedo with 

respect to wavelength. 

One of the advantages of spectral derivatives is that since the dependence on the 
angle of incidence gets cancelled out, there is no need for assuming an infinitely dis- 
tant light source. The incident illumination can vary from one point to another, without 
affecting the resulting spectral derivative. 



4.2 Smooth Diffuse Reflectance Model 

In reality there are very few objects that exhibit perfectly Lambertian reflectance. For 
smooth diffuse objects, the reflected light varies with respect to viewpoint. Wolff[25] 
introduced a new smooth diffuse reflectance model that incorporates the dependence 
on viewpoint: 



S(p, ) = cos .(p) (p, )(1-F( /(p),n(p))) 
j sin ^(p) I 

n(p) ’ n{p) 



( 10 ) 



X 1 - F sin 
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where ^{p) and ^{p) are the incidence and viewing angles respectively, FQ is the 

Fresnel reflection coefficient, and nip) is the index of refraction. The index of refrac- 
tion does indeed depend on wavelength. However, in dielectrics the refractive index 
changes by a very small amount over the visible range[2, 3, 8]. Thus, for dielectrics n 
is commonly treated as a material constant under visible light. 

By taking the logarithm of the surface reflectance function, we simplify the under- 
lying model, by altering multiplicative terms into additive terms: 



ln5(/7. 



) = In cos j (/>) H- In 
H- In I - F sin^^ 



ip, ) + ln(l-F( 

sin ^(p) 1 

n(p) ’ n(p) 



iip), n{p))) 



( 11 ) 



The next step is to compute the partial derivative with respect to wavelength. Once 
again, all the terms except the albedo are set to zero. The dependence on scene geome- 
try including the viewing and incidence angles have been cancelled out. Again, since 
the spectral derivative is independent of incident illumination, the direction of incident 
light can vary from one point to the next, without affecting the spectral derivative. The 
spectral derivative becomes: 



S {p, ) _ ip, ) 

S{p, ) ~ (p, ) 



( 12 ) 



4.3 Generalized Lambertian Model 

Of course, not all diffuse surfaces are smooth. Oren and Nayar[19] developed a gener- 
alized Lambertian model which describes the diffuse reflectance of surfaces with sub- 
stantial macroscopic surface roughness. The macrostructure of the surface is modelled 
as a collection of long V-cavities. (Long in the sense that the area of each facet of the 
cavity is much larger than the wavelength of the incident light.) The modelling of a 
surface with V-cavities is a widely accepted surface description[24, 13]. 

The light measured at a single pixel of an optical sensor is an aggregate measure of 
the brightness reflected from a single surface patch composed of numerous V-cavities. 
Each cavity is composed of two planar Lambertian facets with opposing normals. All 
the V-cavities within the same surface patch have the same albedo, . Different facets 
can have different slopes and orientation. Oren and Nayar assume that the V-cavities 
are uniformly distributed in azimuth angle orientation on the surface plane, while the 
facet tilt follows a Gaussian distribution with zero mean and standard deviation . The 
standard deviation can be viewed as a roughness parameter. When = 0 , all the 
facet normals align with the mean surface normal and produce a planar patch that 
exhibits an approximately Lambertian reflectance. As increases, the V-cavities get 
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deeper and the deviation from Lambert’s law increases. Ignoring interreflections from 
the neighbouring facets, but accounting for the masking and shadowing effects that the 
facets introduce, the Oren-Nayar model approximates the surface reflectance as: 



Sip, , ) = -^cos iip)[C^i ) 
+ cos( ^(p)- i(p))C 2 i 
+ (l-|cos( ,.)|)C3( 



f, ;/?)tan (p) 



;/7)tan 



ip)+ ip) 1 
2 



(13) 



where ( ^ip), lip)) and ( ^(p), ^{p)) are the spherical coordinates of the angles 
of incidence and reflectance accordingly, {p) = maxi iip)^ r^P)) and 
ip) = mini iip)^ r^P))- Cii), C 2 O and C3O are coefficients related to the surface 
macro structure. The first coefficient, C]() depends solely on the distribution of the 
facet orientation, while the other two depend on the surface roughness, the angle of 
incidence and the angle of reflectance: 



Ci( 



) = 1 - 0.5 



2 

2 + 0.33 



(14) 



0.45- 



c,( ;p) = 



0.45- 



2 -H 0.09 
2 

2 -H 0.09 



sin ip) if cos( ip)- -ip)) 0 



sin ip)- 



2 ip) 



(15) 



otherwise 



C3( ;;;/») = 0.125 



^ 4 ip) ip) 2 

2 H- 0.09 2 



(16) 



For clarity of presentation, we set: 



Vip, ) = CjC )H-cos( ^ip)- jip))C 2 i r~ i’ ip) 



(l-|cos( ;)|)C3( ; ; ;p)tan 



ip)+ ip) 



ill) 



The term Vip, ) accounts for all the reflectance effects which are introduced by the 
roughness of the surface. The angles of incidence and reflectance, as well as the distri- 
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bution of the cavities affect the value of the function V(p, ) . The Oren-Nayar 
reflectance model can then be written more compactly as: 

S(p, , ) = ^ cos iip)V{p, ) (18) 



Once again, when we take the logarithm of the surface reflectance function, we sim- 
plify the underlying model, hy turning the product into a sum: 

\nS{p, ) = In {p, )-ln H- In cos ■{p) + lnV{p, ) 



When we compute the partial derivative of this logarithm with respect to wave- 
length, the only term that remains is the albedo. The angle of incidence j{p) and the 

constant , are independent of wavelength. None of the terms in the function V(p, ) , 
see equation (17), vary with the wavelength. More specifically, C]() is only a function 
of , which is a surface property independent of wavelength. C 2 O and C 3 O are func- 
tions of , the viewing and incidence angles, and of and , which in turn are also 
functions of the viewing and incidence directions. None of these factors is affected by 
wavelength. Thus, even for rough diffuse surfaces, the spectral derivative is: 



L (p. 



S ip, ) 
S(P, ) 



(P, ) 
(P, ) 



( 20 ) 



5 Spectral Gradient 

For diffuse surfaces, independent of the particulars of the reflectance behaviour, the 
partial derivative with respect to wavelength of the logarithmic image L (p, ) is a 

function of only the surface alhedo. More specifically the spectral derivative approxi- 
mates the normalized partial derivative of the alhedo with respect to wavelength 
{p, )/ {p, ). By normalized we mean that the derivative is divided by the mag- 
nitude of the albedo itself. 

Consider now a collection of spectral derivatives of a logarithmic image at various 
spectral locations k = 1, 2, 3, ..., M. The resulting spectral gradient is an M- 

dimensional vector (L ,L ) which is invariant to illumination, surface 

1 2 M 

geometry and viewpoint. All it encodes is information at discrete spectral locations 
about how fast the surface alhedo changes as the spectrum changes. It is a profile of the 
rate of change of albedo with respect to wavelength over a range of wavelengths. Thus, 
the spectral gradient is a colour descriptor that depends purely on the surface alhedo, a 
physical material property. Although our perception of colour depends on the surface 
albedo, a colour descriptor, such as the spectral gradient which is purely albedo-based 
does not convey any perceptual meaning of colour. It remains unaltered by changes in 
the environment (viewing position, illumination, geometry). 
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6 Experiments 

6.1 Experimental Setup 

In order to compute the spectral derivatives we took images of each scene under eight 
different narrow bandpass filters: a Corion S10-450-F, a Corion S10-480-F, a Corion 
S10-510-F, a Corion S10-540-F, a Corion S10-570-F, a Corion S10-600-F, a Corion 
S10-630-F and a Corion S10-660-F. Each of these filters has a bandwidth of approxi- 
mately lOnm and a transmittance of about 50%. The central wavelengths are at 450nm, 
480nm, 510nm, 540nm, 570nm, 600nm, 630nm and 660nm respectively. If one were 
to assign colour names to these filters, he/she could label them as follows: 
450nm=blue, 480nm=cyan, 510=green, 540=yellow, 570=amber, 600=red, 630=scar- 
let red, 660=mauve. 

The use of narrow bandpass filters allowed us to closely sample almost the entire 
visible spectrum. The dense narrow sampling permitted us to avoid sampling (or 
ignore samples) where the incident light may be discontinuous (see section 3) Hall and 
Greenberg [10] have demonstrated that such a sampling density provides for the repro- 
duction of a good approximation of the continuous reflectance spectrum. The images 
were captured with a Sony XC-77 camera using a 25mm lens (fig. 2.) 




Fig. 2. A picture of the experimental setup showing the analog greyscale camera and the filter 
wheel mounted in front of it. 



The only source of illumination was a single tungsten light bulb mounted in a 
reflected scoop. For each scene we used four different illumination setups, generated 
by the combination of two distinct light bulbs, a 150W bulb and a 200W bulb and two 
different light positions. One illumination position was to the left of the camera and 
about 5cm below the camera. Its direction vector formed approximately a 30 angle 
with the optic axis. The other light-bulb position was to the right of the camera and 
about 15cm above it. Its direction vector formed roughly a 45 angle with the optic 
axis. Both locations were 40cm away from the scene. 
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The imaged objects were positioned roughly 60cm from the camera/filter setup. We 
tried four different types of materials: foam, paper, ceramic and a curved metallic sur- 
face painted with flat (matte) paint. The foam and the paper sheets came in a variety of 
colours. The foam which was a relatively smooth and diffuse surface came in white, 
pink, magenta, green, yellow, orange and red samples (see fig. 3 top left). The paper 
had a rougher texture and came in pink, fuchia, brown, orange, yellow, green, white, 
blue, and violet colours (see fig. 3 top right). We also took images of a pink ceramic 
plate. Its surface was extremely smooth and exhibited localized specularities and self- 
shadowing at the rims (see fig. 3 bottom left). Finally we took images of single albedo 
curved surfaces (a mug and a painted soda-can) to test the invariance to the viewing 
and incidence angles (see fig. 3 bottom middle and right respectively). 




Fig. 3. Pictures of the objects and materials used in the experiments. Top left: various colours of 
foam; top right: various colours of paper; bottom left: a pink ceramic plate; bottom center: a 
white ceramic mug; bottom right: a white spray-painted soda can. 



Fig. 4 shows samples of the actual images that we captured. All the images in these 
figure are taken using the Corion S10-600-F filter. The light illuminating the scene is a 
200W tungsten bulb positioned at the lower left corner (as can be seen from the high- 
lights). On the top left are the coloured samples of foam, while on the right are the 
multiple coloured samples of paper. By comparing the stripes in fig. 4 with those in fig. 
3, one can tell which colours reflect around 600nm (pink, orange, yellow and white in 
the case of the paper) 
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Fig. 4. Sample filtered images of the objects shown in fig. 3. All the images in this figure were 
taken with the red Corion S10-600-F filter. 



6.2 Computing the Spectral Gradient 

Once a filtered image was captured, its logarithmic image was generated. In a logarith- 
mic image the value stored at each pixel was the natural logarithm of the original 
image intensity. For example: 



= ln(/^) where w = 450, 480, 510, 540, 570, 600, 630, 660 (21) 



where was the image of a scene taken with the SIO-VF-F filter and was its loga- 
rithmic image. The last step involved the computation of the spectral derivatives of the 
logarithmic images. Differentiation was approximated via finite-differencing. Thus, 
each L was computed over the wavelength interval = 30nm by subtracting two 

k 

logarithmic images taken under two different colour filters which were 30nm apart: 



L 



~ + 30 



( 22 ) 



where k = 1,2, ...,7 and w = 450,480,510,540,570,600,630 accordingly. In 
our setup the spectral gradient was a 7- vector: 



{L (23) 

This vector was expected to remain constant for diffuse surfaces with the same 
albedo profile, independent of variations in viewing conditions. At the same time, the 
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spectral gradient should differ between distinct colours. Furthermore, the more distant 
the colours are, the bigger the difference between the respective spectral gradients 
should be. 

The following figures show the plots of the spectral gradient values for each colour 
versus the wavelength. The horizontal axis is the wavelength, ranging from the 
j = 480 - 450 interval to the ^ = 660 - 630 interval. The vertical axis is the nor- 
malized partial derivative of albedo over the corresponding wavelength interval. Fig. 5 
shows the plots of different colours of paper fig. 5(a) and of different colours of foam 
fig. 5(b). Within each group, the plots are quite unique and easily differentiable from 
each other. 



Spectral Gradient Plots 




wavelength 



Spectral Gradient Plots 




wavelength 



(a) 



(b) 



Fig. 5. Spectral gradients of different colours of (a) paper and (b) foam under the same viewing 
conditions (same illumination, same geometry). 



On the other hand, the spectral gradients of different surfaces of the same colour, gen- 
erate plots that look almost identical. Fig. 6 shows the gradient plots for the white 
paper, the white foam, the white mug, and the white painted soda can. 



Spectral Gradient Plots 




wavelength 



Fig. 6. Spectral gradients of different white surfaces (foam, paper, ceramic mug, diffuse can) 
under the same viewing conditions (same illumination, same geometry). 
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In a similar manner, when we have similar but not identical colours, the spectral 
gradient plots resemble each other, but are not as closely clustered. Fig. 7 shows the 
spectral gradients of various shades of pink and magenta. The closer the two shades 
are, the more closely the corresponding plots are clustered. 



Spectral Gradient Plots 




wavelength 



Fig. 7. Spectral gradients of different shades of pink and different materials: pink foam, magenta 
foam, pink paper, fuchia paper and pink ceramic plate. All images were taken under the same 
viewing conditions (same illumination, same geometry). 



The next couple of figures demonstrate that the spectral gradient remains constant 
under variations in illumination and viewing. This is expected as spectral gradients are 
purely a function of albedo. The plots in fig. 8 were produced by measuring the spec- 
tral gradient for the same surface patch while altering the position and intensity of the 
light sources. 
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(a) 



(b) 



Fig. 8. Spectral gradients of the same colour (a) green and (b) pink under varying illumination. 
Both the position and the intensity of illumination is altered, while the viewing position remains 
the same. 



In order to demonstrate the invariance to the viewing angle and the surface geome- 
try, we show the plots of the spectral gradients produced by different patches of the 
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same curved object (the painted soda can in this case). At least one patch has its tan- 
gent roughly parallel to the image plane, while the other patches are at mildly to quite 
oblique angles to the viewer and/or the incident light. As fig. 9 shows, the spectral gra- 
dient plots still remain closely clustered. 



Spectral Gradient Plots 
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Fig. 9. Spectral gradients of white colour at different angles of incidence and reflectance. The 
spectral gradients at different surface patches of the white soda can are shown. The surface nor- 
mals for these patches vary from almost parallel to the optic axis, to very oblique. 



This last series of tests, showed also a limitation of our technique. For very oblique 
angles of incidence, where ^ > 60 the spectral gradients do not remain constant. 

Deviations at large angles of incidence are a known physical phenomenon! 17]. Oren 
and Nayar[19] also point out that in this special case, most of the light that is reflected 
from a surface patch is due to interreflections from nearby facets. In fig. 10 we show 
the spectral gradient plots obtained from patches of the painted can which are at large 
angles of incidence. (The light is almost grazing the surface). For comparison purposes 
we included also two plots produced from patches with smaller angles of incidence. 
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Fig. 10. Spectral gradients of white colour at large angles of incidence. The spectral gradients at 
different surface patches of the white soda can are shown. For all these patches the incident light 
is almost grazing the surface. 
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As a last note, we would like to point out that although some of our test objects 
exhibit specular highlights, we did not analyze any of the data from the specular areas. 
Our methodology assumes diffuse reflectance and we limited our tests to the diffuse 
parts of the images. There is currently ongoing investigation of how specularities and 
interreflections affect the spectral gradient. 



7 Conclusions and future work 

We developed a technique for extracting surface albedo information which is invariant 
to changes in illumination and scene geometry. The spectral information we extract is 
purely a physical property. Hence the term objective colour. We made no assumptions 
about the nature of incident light, other than that its spectrum does not change with its 
position. We showed that spectral gradients can be used on a pixel basis and do not 
depend on neighbouring regions. The effectiveness of spectral gradients as a colour 
descriptor was demonstrated on various empirical data. 

The invariant properties of spectral gradients together with their ease of implemen- 
tation and the minimalism of assumptions, make this methodology a particularly 
appealing tool in many diverse areas of computer vision. They can be used in material 
classification, grey-scale colour constancy, or in tracking different regions under vari- 
able illumination. 

Spectral gradients and the underlying dense spectral sampling provide a rich 
description of surface reflectance behaviour. We are already studying the phenomenon 
of interreflections at large angles of incidence. We are also examining the behaviour of 
highlights under this fine multispectral sampling. Another topic that we are working on 
is the simultaneous employment of multiple illumination sources. Finally a topic that 
we are also planning to address is the effect of changes in the spectrum of the incident 
illumination. 
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Abstract. We improve the promising Colour by Correlation method for 
computational colour constancy hy modifying it to work in a three 
dimensional colour space. The previous version of the algorithm uses only 
the chromaticity of the input, and thus cannot make use of the information 
inherent in the pixel brightness which previous work suggests is useful. 

We develop the algorithm for the Mondrian world (matte surfaces), the 
Mondrian world with fluorescent surfaces, and the Mondrian world with 
specularities. We test the new algorithm on synthetic data, and on a data set 
of 321 carefully calibrated images. We find that on the synthetic data, the 
new algorithm significantly out-performs all other colour constancy 
algorithms. In the case of image data, the results are also promising. The 
new algorithm does significantly better than its chromaticity counter-part, 
and its performance approaches that of the best algorithms. Since the 
research into the method is still young, we are hopeful that the performance 
gap between the real and synthetic case can be narrowed. 

1 Introduction 

The image recorded by a camera depends on three factors: The physical content of the 
scene, the illumination incident on the scene, and the characteristics of the camera. It 
is the goal of computational colour constancy to identify, separate, or mitigate the 
effects of these factors. Doing so has applications in computer vision and image 
reproduction. 

In this paper we improve the promising Colour by Correlation [1] method 
for computational colour constancy by casting it in a three-dimensional colour space. 
Colour by Correlation is promising because it can combine more sources of 
information than the related state-of-the-art gamut-mapping approach [2-6], and thus it 
is potentially even more effective. The extra source of information that becomes 
available is the statistical distribution of expected surfaces and illuminants, and how 
their interactions affect the expected distribution of the observed camera responses. 
However, the current version of Colour by Correlation uses only chromaticity 
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information. Since the algorithm works in a chromaticity space, it has no way of 
using any information that might be available in the pixel brightness. However, 
previous work has shown that it is beneficial to use the pixel brightness information 
[4, 6, 7], even if only the illuminant chromaticity is being sought. Thus it is natural 
to modify Colour by Correlation so that it can also use this information. 

In this paper we provide details of the changes required to have this algorithm 
work in a three-dimensional colour space. The modified algorithm naturally allows 
extensions for both metallic and non-metallic specularities, in analogy with previous 
work on gamut-mapping for these conditions [7, 8]. In addition, the algorithm can 
deal with fluorescent surfaces, much like the two-dimensional version, as described in 
[7, 9]. We begin with a brief discussion of the two-dimensional Colour by 
Correlation algorithm. 

2 Colour by Correlation in Chromaticity Space 

Finlay son et el. introduced Colour by Correlation [1] as an improvement on the 
Colour in Perspective method [3]. The basic idea of Colour by Correlation is to pre- 
compute a correlation matrix which describes how compatible proposed illuminants 
are with the occurrence of image chromaticities. Each row in the matrix corresponds 
to a different training illuminant. The matrix columns correspond to possible 
chromaticity ranges resulting from a discretization of (r,g) space, ordered in any 
convenient manner. Two versions of Colour by Correlation are described in [1]. In the 
first version, the elements of the correlation matrix corresponding to a given 
illuminant are computed as follows: First, the (r,g) chromaticities of the reflectances 
in the training set under that illuminant are computed using the camera sensors. Then 
the convex hull of these chromaticities is found, and all chromaticity bins within the 
hull are identified as being compatible with the given illuminant. Finally, all entries 
in the row for the given illuminant corresponding to compatible chromaticities are set 
to one, and all other elements in that row are set to zero. 

To estimate the illuminant chromaticity, the correlation matrix is multiplied 
by a vector whose elements correspond to the ordering of (r,g) used in the correlation 
matrix. The elements of this vector are set to one if the corresponding chromaticity 
occurred in the image, and zero otherwise. The i'th element of the resulting vector is 
then the number of chromaticities which are consistent with the illuminant. Under 
ideal circumstances, all chromaticities in the image will be consistent with the actual 
illuminant, and that illuminant will therefore have maximal correlation. As is the case 
with gamut-mapping methods, it is possible to have more than one plausible 
illuminant, and in our implementation we use the average of all candidates close to 
the maximum. We label this algorithm C-by-C-01 in the results. 

In the second version of Colour by Correlation, the correlation matrix is set 
up to compute the probability that the observed chromaticities are due to each of the 
training illuminants. The best illuminant can then be chosen using a maximum 
likelihood estimate, or using some other estimate as discussed below. To compute the 
correlation matrix, the set of (r,g) for each illuminant using our database of surface 
reflectances is again found. The frequency of occurrence of each discrete (r,g) is then 
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recorded. If additional information about the frequency of occurrence of these 
reflectances is available, then the frequency counts are weighted accordingly. However, 
since such a distribution is not readily available for the real world, in our 
implementation we simply use uniform statistics. The same applies for the 
illuminant data set. The counts are proportional to the probability that a given (r,g) 
would be observed, given the specific illuminant. The logarithms of these 
probabilities for a given illuminant are stored in a corresponding row of the 
correlation matrix. The application of the correlation matrix, which is done exactly as 
described above, now computes the logarithm of the posterior distribution. 

This computation of the posterior distribution is a simple application of 
Bayes's rule. Specifically, the probability that the scene illuminant is I, given a 
collection of observed chromaticities C, is given by: 



PfIlCj 



PfClIjPfl) 

PfCj 



( 1 ) 



Since we are assuming uniform priors for I, and since P(C) is a normalization which 
is not of interest, this reduces to: 

PfIlCj PfCII) (2) 

Assuming that the observed chromaticities are independent, P(CII) itself is the product 
of the probabilities of observing the individual chromaticities c, given the 
illuminant I: 



P('CIIj= Pfcllj (3) 

c C 

Taking logarithms gives: 

log(V(C\l))= log(V(c\\}) (4) 

c C 

This final quantity is exactly what is computed by the application of the correlation 
matrix to the vector of chromaticity occurrences. Specifically, the i'th element of the 
resulting vector is the logarithm of the posterior probability for the i'th illuminant. 

The method described so far will work fine on synthetic data, provided that 
the test illuminant is among the training illuminants. However, once we apply the 
method to the real world, there are several potential problems. First, due to noise, and 
other sources of mismatches between the model and the real world, an observed set of 
chromaticities can yield zero probability for all illuminants, even if the illuminant, or 
a similar one, is in the training set. Second, the illumination may be a combination 
of two illuminants, such as an arbitrary mix of direct sunlight and blue sky, and 
ideally we would like the method to give an intermediate answer. We deal with these 
problems as follows. First, as described below, we ensure that our illuminant set 
covers (r,g) space, so that there is always a possible illuminant not too far from the 
actual. Second, as we build the correlation matrices, we smooth the frequency 
distribution of observed (r,g) with a Gaussian filter. This ensures that there are no 
holes in the distribution, and compensates for noise. 
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The final step is to choose an answer, given the posterior probability 
distribution, an example of which is shown in Figure 1. The original work [1] 
mentions three choices: The maximum likelihood, mean likelihood, or local area 
mean, introduced in [10]. That work discusses these methods in detail with respect to 
a related colour constancy algorithm, where they are referred to as the MAP, MMSE, 
and MLM estimators, respectively. We will adopt this notation here. The MAP 
estimate is simply the illuminant which has the maximum posterior probability. To 
compute the MMSE estimate of the chromaticity estimate we take the average (r,g) 
weighted by the posterior distribution. The MLM estimator is computed by 
convolving the posterior distribution with a Gaussian mask, and then finding the 
maximum. In general, one would like to choose the particular Gaussian mask which 
minimizes the error of some specific task. Unfortunately, the bulk of our results are 
not of much help here, as they are based on RMS error, and thus we already know that 
the MMSE method will work better. Thus we provide results only for the 
computationally cheaper MAP method, and least-squares optimal MMSE method. 



Example Colour by Correlation Posterior Distribution 



Normalized Probability 




Fig. 1 : An example posterior distribution, showing the probabilities that the illuminant 
in the training set with chromaticity (r,g) explains the observed data produced from a 
randomly selected illuminant and 8 randomly selected surface reflectances. 
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3 The Extension to Three Dimensions 

We now consider what a three-dimensional analog to the two-dimensional algorithm 
entails. In the two-dimensional case, the observed two-dimensional descriptors 
(chromaticities) were tested against possible theories of the distributions of those 
descriptors, each theory corresponding to one of the illuminants in the training set. In 
the three-dimensional version, we wish to do the same with three-dimensional 
descriptors. However, we run into the problem that the brightness of the illuminant 
changes the observed values. In effect, not only does each illuminant produce a theory, 
but every brightness level of each illuminant produces a theory. Thus we must 
attempt to match over possible illuminant brightnesses, as well as over illuminants. 
This leads to several problems. 

The first problem is that, a priori, the illuminant can have any non-negative 
brightness. This is different than chromaticity which is naturally constrained, and thus 
easily discretized. To solve this problem we propose making an initial estimate of the 
illuminant brightness using some other means. For this, we found a grey world type 
estimate to be adequate. Specifically, we compute the average of Rh-Gh-B over the 
image pixels, and multiply the result by a factor chosen to give the best estimate 
when the same procedure was applied to synthetic data. The value of the factor used 
for the experiments was 4.3. For some applications the median or other estimation 
method may very well be superior to the average, being more robust in the face of 
outliers. However, on our data, the mean worked better than the median. 

Having determined an estimate of the illuminant brightness, we reason that it 
is unlikely to be wrong by more than a factor of k=3. Now, on the assumption that 
the illuminant brightness is between L/k and kL, we discretize this range on a 
logarithmic scale, giving us a finite number of possible illuminant brightness 
theories. We verified that the specific choice of k=3 gives the same results as 
providing the algorithm with the exact illuminant brightness in its place. Clearly, a 
larger or smaller value could be more appropriate, depending on circumstances. 

The next problem that we faced is that the literal analogy of the two- 
dimensional method leads to unmanageably large correlation matrices. There are two 
contributions to the increase in size. First, the matrix row length increases because of 
the added descriptor— the rows now store linearized versions of three-dimensional 
arrays where two-dimensional arrays were previously stored. Second, the strategy of 
considering each illuminant at each brightness level implies, a priori that we would 
further need to increase the number of rows by a factor of the brightness resolution 
because now we would need a row for every brightness of every illuminant. The 
combined effect of these two factors lead to correlation matrices which are simply too 
large. 

Fortunately, the second increase in size is not necessary. We instead loop 
over the possible brightnesses, and simply scale the input by an appropriate amount 
each time. Conceptually, this amounts to the same thing as having a correlation 
matrix row for each illuminant at each brightness. In practice, however, it leads to a 
subtle problem due to the discretization. If we consider the alternative of building a 
correlation matrix row for each possible brightness, we see that as the proposed 
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illuminant brightness decreases, the bins become proportionally more populated. For 
example, if the illuminant brightness is halved, then the same data is put into half as 
many bins. Now the algorithm proceeds by summing terms for each observed 
response. The terms are the logarithms of quantities proportional to the probability 
that a proposed illuminant occurs with the observed response. If we consider each term 
to be negative, then we see that decreasing the number of terms increases the sum. 
Since we are trying to maximize this sum, the algorithm will favor low brightness 
values, because these tend to put the observations into as few bins as possible, 
leading to fewer terms. This is an artifact of the discretization, and clearly is not 
wanted. 

We discuss two possible approaches to deal with this problem. First, we can 
allow duplicate entries into the bins. In order to minimize the effect of duplicates 
present in the data, the data could be pre-processed to remove all initial duplicates. 
This method gives excellent results when used in conjunction with generated data. 
However, we have not had equal success with image data. 

A second approach to the above problem is to compensate for the 
discretization problem directly. We reason as follows: If we were to have constructed 
correlation matrices for each brightness level, then the frequency counts placed in the 
bins to compute the probabilities would have been roughly inversely proportional to 
the brightness. Thus the probabilities themselves would be inversely proportional to 
the brightness, and to obtain a fair estimate, we need to divide each probability in the 
product by a value proportional to the brightness. In the log representation, this 
means that we subtract the log of the brightness times the number of occupied bins. 
This method also yields excellent results when used in conjunction with generated 
data. More importantly, the results using image data are also promising. We feel that 
this algorithm can be substantially improved, and one of the key areas for further 
study is this discretization problem. 

We now consider the choice of three-dimensional descriptors. One natural 
choice is RGB. However, given the asymmetry of the role of brightness and 
chromaticity in computational colour constancy, we feel that a better choice is to use 
(r,g) chromaticity, together with R-hGh-B. This has several advantages over using 
RGB. First, due to the above mentioned asymmetry, we may wish to use different 
resolutions for the chromaticity and the brightness. Second, this choice provides 
conceptual clarity, in that our method then subsumes the two-dimensional version as 
the sub-case where there is only one division for the Rh-Gh-B coordinate. Finally, we 
find it convenient to have only one coordinate which can be arbitrarily large. 

The algorithm as described is easily extended to model complex physical 
scenes. For example, we can model fluorescent surfaces, as already done in the two- 
dimensional case in [9], and we can model specular surfaces, including metallic ones, 
as was done for gamut-mapping algorithms in [8]. The Colour by Correlation method 
has an advantage over the gamut-mapping methods in that the expected frequency of 
occurrence of these phenomena can be modeled. Unfortunately we currently do not 
know these statistics for the real world, and hence it is difficult to exploit this in the 
case of image data. Nevertheless, doing so holds promise for the future because if 
some estimate of the likelihood of occurrence of these classes of surfaces could be 
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made, then three-dimensional Colour by Correlation would be more robust than the 
extended versions of three-dimensional gamut mapping. This is due to the fact that it 
can allow for the possibility of, for example, metallic surfaces, while compensating 
for the fact that there is only a low likelihood that such surfaces are present. Gamut- 
mapping, on the other hand, is forced to use uniform statistics. 

4 Algorithm Summary 

We now provide a summary of the method. The implementation of the algorithm 
consists of two parts. First the correlation matrices are built, and then these matrices 
are used to perform colour constancy. The first stage is a one time operation, and 
consequently, we are not concerned about resource usage. We begin with a data set of 
illuminant and reflectance spectra. Ideally, we would know the expected frequency of 
occurrence of these surfaces and illuminants, but since we do not, we assume that 
there are all equally likely. For surface reflectances we used a set of 1995 spectra 
compiled from several sources. These surfaces included the 24 Macbeth colour checker 
patches, 1269 Munsell chips, 120 Dupont paint chips, 170 natural objects, the 350 
surfaces in Krinov data set [1 1], and 57 additional surfaces measured by ourselves. 

The choice of illuminant spectra must be made with more care, as the 
algorithms are sensitive to the statistics of the occurrence of the illuminants in the 
training set. We feel that it is best to have the training and testing sets both at least 
roughly uniformly distributed in (r,g) space. To obtain the appropriate illuminant 
sets, we first selected 1 1 sources to be used for the image data. These were selected to 
span the range of chromaticities of common natural and man made illuminants as best 
as possible, while bearing in mind the other considerations of stability over time, 
spectral nearness to common illuminants, and physical suitability. To create the 
illuminant set used for training, we divided (r,g) space into cells 0.02 units wide, and 
placed the 1 1 illuminants described above into the appropriate cells. We then added 
illumination spectra from a second set of 97, provided that their chromaticity bins 
were not yet occupied. This second set consisted of additional sources, including a 
number of illumination spectra measured in and around our university campus. Then, 
to obtain the desired density of coverage, we used random linear combinations of 
spectra from the two sets. This is justified because illumination is often the blending 
of light from two or more sources. Finally, to produce the illuminant set for testing, 
we followed the same procedure, but filled the space 4 times more densely. 

Given the illuminant and reflectance data sets, we generate the required sensor 
responses using estimates of our camera sensitivity functions, determined as described 
in [12]. Thus to apply the algorithms to image data, we must first map the data into 
an appropriate linear space (also described in [12]), and perform other adjustments to 
compensate for the nature of the imaging process as described more fully in [7]. 

We use the colour space (r,g,L) where L=R-hGh-B, r=R/L, and g=G/L. We 
divide the space into discrete bins. The resolution of the discretization of the three 
components do not need to be equal. There is no reason to make the first two different 
from each other, but, as discussed above, it can be advantageous to use a different 
value for the third. For all experiments we used 50 divisions for (r,g), which is 
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consistent with the discretization resolution used in this work for two-dimensional 
Colour by Correlation. When specularities are added, as discussed shortly, the overall 
number of bins required for L increases. We express the resolution for L in terms of 
the number of bins devoted to matte reflection. For the experiments with generated 
data, we generally used a value for L which also leads to 50 divisions for matte 
reflection, but this is likely higher resolution than is necessary, and in fact, 
preliminary results indicate that a smaller number is likely better. Thus for the image 
data experiments, we used 25 divisions. 

Given a discretization of colour space, we then map this space into a vector, 
using any convenient method. We note that since half of the values in (r,g) are 
impossible, a more compact representation can be used than the naive one. Since the 
three-dimensional correlation matrices are large, we make use of this observation to 
reduce storage requirements. 

Thus we form a two-dimensional array, where each row is the above 
linearization of colour space, and the rows correspond to training illuminants. We 
then build up the matrix by computing, for each illuminant, the RGB of the 
reflectances in our database. We then compute the frequency of occurrence of the 
colours within each discrete cell in our colour space. These frequencies are 
proportional to the probabilities; they can be converted to probabilities by dividing by 
the total number of surfaces. Finally, for convenience, we store the logarithm of the 
probabilities. 

To add fluorescent surfaces, we compute the responses which occur for each 
illuminant using the model described in [9]. The relative expected frequency of such 
surfaces is expressed by simply adjusting the frequency counts during the construction 
of the correlation matrix. In our experiments with fluorescent surfaces, we set the 
frequency of occurrence of any fluorescent surface to be about 20%. Since we only 
model 9 such surfaces, the frequency of occurrence of each was set to be 50 times that 
of each of the surfaces in the set of roughly 2000 reflectances. 

We can also model specular reflection. This is a little more involved than 
handling fluorescent surfaces. First, we need to extend the number of bins in the L 
direction, as specular reflection is modeled as reflection which exceeds that of a perfect 
white. Then, we must model both the relative frequency of occurrence of specularities, 
as well as the frequency of each degree of specular reflection. It should be clear that the 
model can well be used with metallic specularities, an analogy with the work in [8], 
but we do not study those here. 

The second part of the algorithm is the use of the above matrix for colour 
constancy. We wish to compute the likelihood of an illuminant-brightness 
combination. We loop over the possible illuminants, and then the possible 
brightnesses, to obtain an estimate for each combination. To compute a maximum 
likelihood estimate, we simple keep track of the maximum value reached and the 
corresponding illuminant and brightness. However, since we are also interested in 
studying the mean likelihood estimate, we store all values in order to make that 
estimate from them as a second step. We now provide additional details of the 
likelihood calculation. 
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Again, for each proposed illuminant, we loop over a discretization of 
possible brightnesses on a log scale. We remind the reader that the range is set by an 
initial rough estimate of the brightness. We generally use 101 brightness levels; This 
is likely excessive. For each proposed brightness level, we scale the input 
accordingly, using the brightness of the proposed illuminant. We then form a vector 
representing the observed scene assuming this brightness level. The components of 
this vector correspond to the linearized form of the discretized colour space. 

To compute the entries of this vector we begin by initializing all 
components to zero. We then compute the corresponding bin for each colour observed 
in the scene. If we are using the first method to solve the discretization problem 
discussed in the previous section, then we store a count of the number of colours 
falling in each bin. Alternatively, if we are using the second method we simply note 
the presence of the colour with a count of one. All bins corresponding to colours not 
observed remain zero. 

To obtain the likelihood of the proposed illuminant-brightness combination, 
we simply take the dot product of the computed vector with the row in the correlation 
matrix corresponding to the proposed illuminant. Since the values stored in the 
correlation matrix are the logarithms of probabilities, the dot product computes the 
logarithm of the product of the probability contributions for each observation (see 
Equation 5). If we are using the second method to compensate for the discretization 
problem discussed above, we then adjust the result by subtracting the logarithm of the 
proposed brightness times the count of the occupied bins. 

5 Experiments 

We tested the new algorithm on generated and image data. For the first two sets of 
results on generated data we used the first method of dealing with the discretization 
problem. For the third set of results with generated data, as well as for the image data 
results, we used the second method. For the experiments with generated data we used 
the set of test illuminants describe above. We remind the reader that both the training 
illuminant set and the test illuminant set were designed to systematically cover (r,g) 
space, but the test illuminant set covered that space four times more densely. 

Figure 2 shows the chromaticity performance of the method using both 
maximum likelihood and mean likelihood estimation as a function of the number of 
surfaces in the generated scenes. We also provide the results for corresponding two- 
dimensional versions of the algorithms, as well as the results for two gamut mapping 
methods— the original method [2], labeled CRULE-MV in the results, and a new 
variant introduced in [7], and labeled as ND-ECRULE-SCWIA-12 to maintain 
consistency with the literature. This later algorithm has been shown to be comparable 
to the best computational colour constancy algorithms over a wide range of 
conditions, and thus provides a good counter-point to the three-dimensional version of 
Colour by Correlation. 

The results clearly show that the new method excels when tested on data with 
similar statistics to that used for training. The error drops to the minimum possible 
given the discretization when only 16 surfaces are used, clearly out-performing the 
other algorithms. 
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For the second experiment we looked at the performance of the method under 
a variety of simulated conditions. We developed three-dimensional Colour by 
Correlation algorithms for fluorescent surfaces and specular reflection, and tested these 
algorithms, along with the version for matte surfaces, under the conditions of matte 
surfaces, the matte and fluorescent surfaces, and matte surfaces with specularities. The 
test conditions were similar to the training conditions, especially in the fluorescent 
case. In the specular case, the rough discretization of specular reflection used for 
creating the correlation matrices only approximates what the algorithms were tested 
against. 

Again the results, shown in Table 2, are very promising. As expected, the 
algorithms do very well when tested under the conditions they were designed for. More 
promising is that the algorithms seem quite robust to the absence of these conditions. 
For example, adding fluorescent capability reduced the error from 0.060 to 0.022 when 
fluorescent surfaces were present, but using the algorithm with fluorescent capability 
in the case of standard surfaces incurred minimal penalty (0.026 instead of 0.025). 
(These figures are using the MMSE estimator). In general, it is clear that for generated 
data, these algorithms perform better than any of the others which are listed in 
Table 2. 

For the third experiment, we tested the second method of dealing with the 
discretization problem discussed above. The results are shown in Table 3. We also 
include additional comparison algorithms in this table. Again, the new methods do 
significantly better than the next best strategy, which is the ND-ECRULE-SCWIA-12 
algorithm developed in [7]. Using the MMSE estimator, the three-dimensional Colour 
by Correlation error is 0.24; using the MAP estimator it is 0.29; and using ND- 
ECRULE-SCWIA-12 it is 0.39. The results also indicate that the second method of 
dealing with our discretization problem may be better than the first, as the errors are 
lower, but we note that the difference can also easily be explained by random 
fluctuations within our error estimates. 

We also tested the method on a data set of 321 images. These images were of 
30 scenes under 1 1 different illuminants (9 were culled due to problems). The images 
are described in more detail in [7]. As mentioned above, we have not yet been able to 
significantly improve upon the two-dimensional method using the first method of 
dealing with our discretization problem. Using the second method, however, the 
results, shown in Table 4, are very promising. We see that the error of the new 
method (0.46) is approaching that of the best performer listed, namely 
ECRULE-SCWIA-12 (0.37) and CRULE-MV (0.045). This error is significantly less 
than that for the two-dimensional counter-part (0.077). 

6 Conclusion 

We have shown how to modify the Colour by Correlation algorithm to work 
in a three-dimensional colour space. This was motivated by the observations that the 
correlation method is more powerful than the chromaticity gamut-mapping method 
due to the use of statistical information, and that three-dimensional gamut mapping is 
also more effective than its chromaticity counterpart due to the use of information 
inherent in the pixel brightness. We wished to combine these two features into one 
algorithm The resulting algorithm is also suitable for modification to deal with 
complex physical surfaces such as fluorescence, and standard and metallic 
specularities. In fact, if the frequency of occurrence of these surfaces could be 
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estimated, then this algorithm could also exploit these statistics. In summary, this 
algorithm is able to use more sources of information than any other, and thus is 
potentially the most powerful colour constancy method. 

We tested a number of versions of the algorithm on synthetic and image data. 
The results with synthetic data are excellent, and it seems that these algorithms are in 
fact the best performers in this situation. The results with image data are also 
promising. In this case the new methods perform significantly better than their two 
dimensional counterparts. Currently, however, the performance still lags a little 
behind the best algorithms for image data. It is quite possible that the performance 
gap between real and image data can be reduced, as we have only recently begun to 
study the algorithm in this context. However, previous work [7] has shown that 
statistical algorithms do tend to shine during synthetic testing, and therefore, we must 
be cautious not to over-sell the method until the image data performance exceeds that 
of the current best methods. 

Finally we note that the algorithm as described is computationally quite 
expensive, both in terms of memory use, and CPU time. Since our initial intention 
was to push the limits of the error performance, we have not addressed ways to speed 
up the algorithm. If the performance on image data can be made comparable to that for 
generated data, then an important next step is to consider what can be done in this 
regard. 
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LOG^ (Number of generated (R,G,B)) 



Fig. 2 : The chromaticity performance of the new method as compared to the two- 
dimensional version of the algorithm and two gamut mapping methods. For both 
Colour by Correlation methods we provide results using both maximum likelihood 
(MAP) and mean likelihood (MMSE) estimation. 
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NOTHING 


The result of doing no colour constancy processing 


AVE 


The illuminant is assumed to be the average of the illuminant data 
base (normalized), regardless of the input. 


MAX 


Estimate illuminant by the max RGB in each channel. 


GW 


Estimate illuminant colour by assuming that image average is the 
colour of a 50% reflectance 


DB-GW 


Estimate illuminant colour by assuming that image average is the 
colour of the average of a reflectance database. 


CRUDE 


Original gamut constraint method described in 


ECRULE 


CRUDE with illumination constraint 


MV 


Solutions are chosen from the feasible set delivered by the gamut 
mapping method using the max volume heuristic 


ICA 


Solutions are chosen from the feasible set delivered by the gamut 
mapping method by averaging. 


SCWIA 


Solutions are the average over feasible illuminant chromaticities, 
weighted by a function chosen to emphasize illuminants with 
chromaticities around the maximum volume solution, as described 
in [7]. 


ND 


Gamut mapping algorithm is extended to reduce diagonal model 
failure as described in [7, 9] 


C-by-C-MAP 


Colour by Correlation [1], with a Gaussian mask to smooth the 
correlation matrix and maximum likelihood estimate. 


C-by-C-MMSE 


Colour by Correlation [1], with a Gaussian mask to smooth the 
correlation matrix and mean likelihood estimate. 


3D-C-by-C-MAP 


The Colour by Correlation method for a three-dimensional colour 
space as developed in this paper, and using the maximum 
likelihood estimate. 


3D-C-by-C-MMSE 


The Colour by Correlation method for a three-dimensional colour 
space as developed in this paper, and using the mean likelihood 
estimate. 


PL 


Algorithm is extended for fluorescence. For gamut mapping and 
two-dimensional Colour by Correlation, this is described in [9]. 
For 3D-C-by-C, the algorithm is developed in this work. 


SPEC 


Algorithm is extended for fluorescence. For gamut mapping this is 
described in [8]. For 3D-C-by-C, the algorithm is developed in this 
work. 



Table 1: Key to the algorithms referred to in the results. 
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Synthetic 
scenes with 8 
matte surfaces 


Synthetic 
scenes with 8 
matte and 
fluorescent 
surfaces 


Synthetic 
scenes with 8 
matte and 
specular 
surfaces 


NOTHING 


0.116 


0.114 


0.110 


AVE-ILLUM 


0.088 


0.086 


0.084 


GW 


0.057 


0.116 


0.034 


DB-GW 


0.047 


0.092 


0.032 


MAX 


0.066 


0.104 


0.033 


C-by-C-01 


0.078 


0.071 


0.079 


C-by-C-MAP 


0.044 


0.059 


0.040 


C-by-C-MMSE 


0.037 


0.048 


0.033 


3D-C-by-C-MAP 


0.030 


0.066 


0.043 


3D-C-by-C-MMSE 


0.025 


0.060 


0.037 


EL-3D-C-by-C-MAP 


0.033 


0.023 


* 


EL-3D-C-by-C-MMSE 


0.026 


0.022 


* 


SPEC-3D-C-by-C-MAP 


0.038 


* 


0.023 


SPEC-3D-C-by-C-MMSE 


0.032 


* 


0.017 


CRULE-MV 


0.050 


0.103 


0.027 


CRULE-AVE 


0.061 


0.088 


0.052 


ECRULE-MV 


0.045 


0.078 


0.026 


ECRULE-ICA 


0.051 


0.065 


0.045 


EL-ECRULE-MV 


0.049 


0.061 


0.027 


EL-ECRULE-ICA 


0.058 


0.051 


0.056 


SP-ND-ECRULE-MV 


0.053 


0.085 


0.029 


SP-ND-ECRULE-ICA 


0.047 


0.062 


0.026 



Table 2: Algorithm chromaticity performance under three different conditions of 
variants of the new methods designed for the various conditions, as well as that for a 
number of comparison algorithms. For these results, the first method of dealing with 
our discretization problem was used. 
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Algorithm 


Performance estimating (r,g) 
chromaticity of the illuminant. 
( 4%) 


NOTHING 


0.111 


AVF.-n.T.TJM 


0.083 


GW 


0.055 


DB-GW 


0.047 


MAX 


0.061 


C-by-C-01 


0.076 


C-by-C-MAP 


0.043 


C-by-C-MMSE 


0.035 


3D-C-by-C-MAP 


0.029 


3D-C-by-C-MMSE 


0.024 


CRULE-MV 


0.048 


ND-ECRULE-SCWIA- 1 2 


0.039 



Table 3: Algorithm chromaticity performance in the Mondrian world of the new 
method (MAP and MMSE), as well as that for a number of comparison algorithms. 
For these results, the second method of dealing with our discretization problem was 
used. 



Algorithm 


Performance estimating (r,g) 
chromaticity of the illuminant. 
( 4%) 


NOTHING 


0.125 


AVF.-n.T.TJM 


0.094 


GW 


0.106 


DB-GW 


0.088 


MAX 


0.062 


C-by-C-01 


0.075 


C-by-C-MAP 


0.084 


C-by-C-MMSE 


0.077 


3D-C-by-C-MAP 


0.047 


3D-C-by-C-MMSE 


0.046 


CRULE-MV 


0.045 


ECRULE-SCWIA-12 


0.037 



Table 4: Algorithm chromaticity performance on 321 images of the new method 
with two estimators (MAP and MMSE), as well as that for a number of comparison 
algorithms. For these results, the second method of dealing with our discretization 
problem was used. 
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Abstract. In his paper we introduce two improvements to the three- 
dimensional gamut mapping approach to computational colour constancy. 
This approach consist of two separate parts. First the possible solutions 
are constrained. This part is dependent on the diagonal model of 
illumination change, which in turn, is a function of the camera sensors. In 
this work we propose a robust method for relaxing this reliance on the 
diagonal model. The second part of the gamut mapping paradigm is to 
choose a solution from the feasible set. Currently there are two general 
approaches for doing so. We propose a hybrid method which embodies the 
benefits of both, and generally performs better than either. We provide 
results using both generated data and a carefully calibrated set of 321 
images. In the case of the modification for diagonal model failure, we 
provide synthetic results using two cameras with a distinctly different 
degree of support for the diagonal model. Here we verify that the new 
method does indeed reduce error due to the diagonal model. We also verify 
that the new method for choosing the solution offers significant 
improvement, both in the case of synthetic data and with real images. 



1 Introduction 

The image recorded by a camera depends on three factors: The physical content of the 
scene, the illumination incident on the scene, and the characteristics of the camera. 
This leads to a problem for many applications where the main interest is in the 
physical content of the scene. Consider, for example, a computer vision application 
which identifies objects by colour. If the colours of the objects in a database are 
specified for tungsten illumination (reddish), then object recognition can fail when the 
system is used under the very blue illumination of blue sky. This is because the 
change in the illumination affects object colours far beyond the tolerance required for 
reasonable object recognition. Thus the illumination must be controlled, determined, 
or otherwise taken into account. 

Compensating for the unknown illuminant in a computer vision context is 
the computational colour constancy problem. Previous work has suggested that some 
of the most promising methods for solving this problem are the three dimensional 
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gamut constraint algorithms [1-4], In this paper we propose two methods for further 
improving their efficacy. 

The gamut mapping algorithms consist of two stages. First, the set of 
possible solutions is constrained. Then a solution is chosen from the resulting 
feasible set. We propose improvements to each of these two stages. To improve the 
construction of the solution set, we introduce a method to reduce the error arising 
from diagonal model failure. This method is thus a robust alternative to the sensor 
sharpening paradigm [2, 4-6]. For example, unlike sensor sharpening, this method is 
applicable to the extreme diagonal model failures inherent with fluorescent surfaces. 
(Such surfaces are considered in the context of computational colour constancy in [7]). 

To improve solution selection, we begin with an analysis of the two current 
approaches, namely averaging and the maximum volume heuristic. These methods are 
both attractive; the one which is preferred depends on the error measure, the number of 
surfaces, as well as other factors. Thus we propose a hybrid method which is easily 
adjustable to be more like the one method or the other. Importantly, the combined 
method usually gives a better solution than either of the two basic methods. 
Furthermore, we found that it was relatively easy to find a degree of hybridization 
which improves gamut mapping colour constancy in the circumstances of most 
interest. We will now describe the two modifications in more detail, beginning with 
the method to reduce the reliance on the diagonal model. 

2 Diminishing Diagonal Model Error 

We will begin with a brief review of Forsyth’s gamut mapping method [1]. First we 
form the set of all possible RGB due to surfaces in the world under a known, 
“canonical” illuminant. This set is convex and is represented by its convex hull. The 
set of all possible RGB under the unknown illuminant is similarly represented by its 
convex hull. Under the diagonal assumption of illumination change, these two hulls 
are a unique diagonal mapping (a simple 3D stretch) of each other. 

Figure 1 illustrates the situation using triangles to represent the gamuts. In 
the full RGB version of the algorithm, the gamuts are actually three dimensional 
polytopes. The upper thicker triangle represents the unknown gamut of the possible 
sensor responses under the unknown illuminant, and the lower thicker triangle 
represents the known gamut of sensor responses under the canonical illuminant. We 
seek the mapping between the sets, but since the one set is not known, we estimate it 
by the observed sensor responses, which form a subset, illustrated by the thinner 
triangle. Because the observed set is normally a proper subset, the mapping to the 
canonical is not unique, and Forsyth provides a method for effectively computing the 
set of possible diagonal maps. (See [1, 2, 4, 8-10] for more details on gamut 
mapping algorithms). 




We now consider the case where the diagonal model is less appropriate. Here 
it may be possible that an observed set of illuminants does not map into the canonical 
set with a single diagonal transform. This corresponds to an empty solution set. In 
earlier work we forced a solution by assuming that such null intersections were due to 
measurement error, and various error estimates were increased until a solution was 
found. However, this method does not give very good results in the case of extreme 
diagonal failures, such as those due to fluorescent surfaces. The problem is that the 
constraints were relaxed indiscriminately, rather than in concordance with the world. 
Similarly, even if the solution set is not null, if there is extreme diagonal failure, 
then the solution set may not be as appropriate for selecting the best solution (by 
averaging, say), than a set which is larger, but more faithful to the true possibilities. 

To deal with diagonal model failure, we propose the following modification: 
Consider the gamut of possible RGB under a single test illuminant. Call this the test 
illuminant gamut. Now consider the diagonal map which takes the RGB for white 
under the test illuminant to the RGB for white under the canonical illuminant. If we 
apply that diagonal map to our test illuminant gamut, then we will get a convex set 
similar to the canonical gamut, the degree of difference reflecting the failure of the 
diagonal model. If we extend the canonical gamut to include this mapping of the test 
set, then there will always be a diagonal mapping from the observed RGB of scenes 
under the test illuminant to the canonical gamut. We repeat this procedure over a 
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The gamuts of all possible RGB under three training illuminants. 




Mapped sets to 
canonical based on 
white. The maps are 
not all the same due to 
diagonal model failure. 



Extended canonical gamut is the convex hull of the 
union of mapped sets based on white, using a 
collection of representative training illuminants 



Fig. 2 : Illustration of the modification to the gamut mapping method to reduce 
diagonal model failure. 



representative set of illuminants to produce a canonical gamut which is applicable to 
those illuminants as well as any convex combination of them. The basic idea is 
illustrated in Figure 2. 

3 Improving Solution Choice 

Once a constraint set has been found, the second stage of the gamut mapping method 
is to select an appropriate solution from the constraint set. Two general methods have 
been used to do this. First, following Forsyth [1], we can select the mapping which 
maximizes the volume of the mapped set. Second, as proposed by Barnard [2], we can 
use the average of the possible maps. When Finlayson's illumination constraint is 
used, then the set of possible maps is non-convex. In [2], averaging was simplified by 
using the convex hull of the illuminant constraint. In [10] Monte Carlo integration 
was used in conjunction with the two dimensional version of the algorithm, and in [4, 
chapter 4] the average was estimated by numerical integration. 
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In [4, chapter 4], we found that both averaging and the maximum volume 
method have appeal. We found that the preferred method was largely a function of the 
error measure, with other factors such as the diversity of scene surfaces also playing a 
role. When the scene RGB mapping error measure is used, the average of the possible 
maps is a very good choice. In fact, if we are otherwise completely ignorant about the 
map, then it is the best choice in terms of least squares. 

On the other hand, if we use an illumination estimation measure, then the 
original maximum volume heuristic is often the best choice. This is important 
because we are frequently most interested in correcting for the mismatch between the 
chromaticity of the unknown illuminant and the canonical illuminant. In this case, 
the errors based on the chromaticity of the estimated scene illuminant correlate best 
with our goal, and the maximum volume heuristic tends to give the best results. 

In this work we will focus on estimating the chromaticity of the illuminant. 
Despite the success of the maximum volume heuristic, we intuitively feel that, at 
least in some circumstances, some form of averaging should give a more robust 
estimate. This intuition is strengthened by the observation that when we go from 
synthetic to real data, the maximum volume method looses ground to averaging (see, 
for example, [4, Figure 4.12]. 

We begin our analysis by considering solution selection by averaging. Here 
we will assume that we are ignorant of the prior likelihood of the various 
possibilities, and thus averaging the possibilities corresponds to integrating 
corresponding volumes. The algorithms published so far integrate in the space of 
diagonal maps, which is not quite the same as the space of illuminants. Under the 
diagonal model, the illuminant RGB is proportional to the element-wise reciprocal of 
the diagonal maps. Thus we see that for an illumination oriented error measure, we 
may be averaging in the wrong space, as intuitively, we want to average possible 
illuminants. 

However, averaging the possible illuminants has some difficulties. As we go 
towards the origin in the space of possible diagonal maps, the corresponding proposed 
illuminant becomes infinitely bright. The origin is included in the constraint set 
because we assume that surfaces can be arbitrarily dark. Although it is rare for a 
physical surface to have a reflectivity of less than 3%, surfaces can behave as though 
they are arbitrarily dark due to shading. Thus we always maintain the possibility that 
the illuminant is very bright. Specifically, if (R,G,B) is a possible illuminant colour, 
then (kR,kG,kB) is also a possible illuminant for all k>l. Put differently, a priori the 
set of RGB all possible illuminants is considered to be a cone in illuminant RGB 
space [11]. When we add the surface constraints, then the cone becomes truncated. As 
soon as we see anything but black, we know that the origin is excluded, and specific 
observed sensor responses lead to specific slices being taken out of the cone. 

The above discussion underscores the idea that when we average illuminants, 
we should ignore magnitude. However, since the work presented in [4] demonstrates 
that the three dimensional algorithms outperform their chromaticity counterparts, we 
do not want to completely throw away the brightness information. Considering the 
truncated cone again, we posit that the nature of the truncations matter. The problem 
is then how to average the possible illuminants. 

Consider, for a moment, the success of the three-dimensional gamut 
mapping algorithms. In the space of maps, each direction corresponds to a illuminant 
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chromaticity. Loosely speaking, the chromaticity implied by an RGB solution, 
chosen in some manner, is the average of the possible chromaticities, weighted by an 
appropriate function. For example, the maximum volume estimate simply puts all 
the weight in the direction of the maximum coordinate product. Similarly, the average 
estimate weights the chromaticities by the volume of the cone in the corresponding 
direction. 

Given this analogy, we can consider alternative methods of choosing a 
chromaticity solution. Since the maximum volume method tends to give better 
chromaticity estimates, especially when specularities are present, we wish to consider 
averages which put the bulk of the weight on solutions near the maximum volume 
direction. Now, one possible outcome of doing so would be the discovery that the 
maximum volume weighting worked the best. Interestingly, this proved not to be the 
case. Specifically we were able to find compromises which worked better. 

We now present the weighting function developed for this work. Consider 
the solution set in mapping space. Then, each illuminant direction intersects the 
possible solution set at the origin, and at some other point. For an illuminant 
direction, i, let that other point be (d|^‘^dg j. Then, the functions we use to 

moderate the above weighting are powers of the geometric mean of coordinates of that 
mapping. Formally, we have parameterized functions f given by: 

We note that the solution provided by the average of the mappings is roughly f 3 . The 
correspondence is not exact because the averaging is done over illuminant directions, 
not mapping directions. Similarly, as N becomes very large, the new method should 
approach the maximum volume method. 

In order to use the above weighting function, we integrate numerically in 
polar coordinates. We discretize the polar coordinates of the illuminant directions 
inside a rectangular cone bounding the possible illuminant directions. We then test 
each illuminant direction as to whether it is a possible solution given the surface and 
illumination constraints. If it is, we compute the weighting function, and further 
multiply the result by the polar coordinate foreshortening, sin( ). We sum the 
results over the possible directions, and divide the total by the total weight to obtain 
the weighted average. 

Finally we note that the above numerical integration, while clearly more 
computationally intensive than some of the precursor algorithms, still takes only a 
few seconds on a modern work station (even with using a very conservative 
discretization volume). 

4 Experiments 

We first consider the results for the method introduced to deal with diagonal 
model failure. Since the efficacy of the diagonal model is known to be a function of 
the camera sensors [1, 4, 5, 8, 12, 13], we provide results for two cameras with 
distinctly different degrees of support for the diagonal model. Our Sony DXC-930 
video camera has quite sharp sensors, and with this camera, the changes in sensor 
responses to illumination changes can normally be well approximated with the 
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diagonal model. On the other hand, the Kodak DCS-200 digital camera [14] has less 
sharp sensors, and the diagonal model is less appropriate [6]. 

In the first experiment, we generated synthetic scenes with 4, 8, 16, 32, 65, 
128, 256, 512, and 1024 surfaces. For each number of surfaces, we generated 1000 
scenes with the surfaces randomly selected from the reflectance database and a 
randomly selected illuminant from the test illuminant database. For surface 
reflectances we used a set of 1995 spectra compiled from several sources. These 
surfaces included the 24 Macbeth colour checker patches, 1269 Munsell chips, 120 
Dupont paint chips, 170 natural objects, the 350 surfaces in Krinov data set [15], and 
57 additional surfaces measured by ourselves. The illuminant spectra set was 
constructed from a number of measured illuminants, augmented where necessary with 
random linear combinations, in order to have a set which was roughly uniform in (r,g) 
space. This data set is describe in more detail in [4, chapter 4]. 

For each algorithm and number of scenes we computed the RMS of the 1000 
results. We choose RMS over the average because, on the assumption of roughly 
normally distributed errors with mean zero, the RMS gives us an estimate of the 
standard deviation of the algorithm estimates around the target. This is preferable to 
using the average of the magnitude of the errors, as those values are not normally 
distributed. Finally, given normal statistics, we can estimate the relative error in the 
RMS estimate by 1/V2N [16, p. 269] For N=1000, this is roughly 2%. 

For each generated scene we computed the results of the various algorithms. 
We considered three-dimensional gamut mapping, with and without Finlayson's 
illumination constraint [9]. We will label the versions without the illumination 
constraint by CRULE, which is adopted from [1]. When the illumination constraint is 
added, we use the label ECRULE instead (Extended-CRULE). Solution selection 
using the maximum volume heuristic is identified by the suffix MV. For averaging in 
the case of CRULE, we use the suffix AVE, and in the case of ECRULE, we use the 
suffix ICA, indicating that the average was over the non-convex set (Illumination- 
Constrained- Average). This gives a total of four algorithms: CRULE-MV, CRULE- 
AVE, ECRULE-MV, and ECRULE-ICA. Finally, the method described above to 
reduce diagonal model failure will be indicated by the prefix ND (Non-Diagonal). We 
test this method in conjunction with each of the four previous algorithms, for a total 
of eight algorithms. We report the distance in (r,g) chromaticity space between the 
scene illuminant and the estimate thereof. 

In Figure 3 we show the results for the Sony DXC-930 video camera. We 
see that when solution selection is done by averaging (AVE and ICA), the ND 
algorithms work distinctly better than their standard counter-parts. On the other hand, 
when solutions are chosen by the maximum volume heuristic, the ND algorithms 
performed slightly worse than their standard counterparts, provided that the number of 
surfaces was not large. Interestingly, as the number of surfaces becomes large, the 
error in all the ND versions continues to drop to zero, whereas the error in the standard 
versions levels off well above zero. In [4, page 92] we postulated that this latter 
behavior was due to the limitations of the diagonal model, and the present results 
confirm this. 

In Figure 4 we show the results for the Kodak DCS-200 digital camera. The 
sensors of this camera do not support the diagonal model very well [6], and thus it is 
not surprising that the new extension significantly improves the performance of all 
four algorithms. 
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Fig. 3; Algorithm chromaticity performance versus the number of surfaces in generated 
scenes, showing the main gamut mapping algorithms and their non-diagonal 
counterparts. These results are for the Sony DXC-930 video camera which has relatively 
sharp sensors (the diagonal model is a good approximation in general). The error in the 
plotted values is roughly 2%. 
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Fig. 4; Algorithm chromaticity performance versus the number of surfaces in generated 
scenes, showing the main gamut mapping algorithms and their non-diagonal 
counterparts. These results are for the Kodak DCS-200 digital camera which has relatively 
dull sensors (the diagonal model is not very accurate). The error in the plotted values is 
roughly 2%. 
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We now turn to results with generated data for the solution selection method 
developed above. For this experiment we included the ND extension to reduce the 
confound of diagonal model failure. We label the new method with the suffix SCWIA 
(Surface-Constrained-Weighted-Illuminant-Average) followed by the value of the 
parameter N in Equation (1). The results are shown in Figure 5. First we point out 
that solution selection by the original averaging method out-performs the maximum 
volume heuristic when the number of surfaces is small, but as the number of surfaces 
increases, the maximum volume heuristic quickly becomes the preferred method. 

Turning to the new method, we see that it indeed offers a compromise 
between these two existing methods, with the new method tending towards the 
maximum volume method as N Increases. More importantly, as long as N is 6 or 
more, the new method invariably outperforms solution selection by averaging. 
Furthermore, for N in the range of 9-24, the performance of the new method is better 
than the maximum volume heuristic, except when the number of surfaces is 
unusually large. When the number of surfaces becomes large, the maximum volume 
heuristic eventually wins out. 

An important observation is that the results for N in the range of 9-24 are 
quite close, especially around 8 surfaces. This is fortuitous, as we have previously 
observed [4] that 8 synthetic surfaces is roughly comparable in difficulty to our image 
data. Thus we are most interested in improving performance in the range of 4-16 
surfaces, and we are encouraged that the results here are not overly sensitive to N, 
provided that it is roughly correct. Based on our results, N=12 appears to be a good 
compromise value for general purpose use. 

Next we present some numerical results in the case of the Sony camera 
which shows the interactions of the two modifications. These are shown in Table 1. 
The main point illustrated in this table is that the slight disadvantage of the ND 
method, when used in conjunction with MV, does not carry over to the new solution 
selection method. To explain further, we note that the positive effect of reducing the 
diagonal model error can be undermined by the expansion of the canonical gamut, 
which represents an increase in the size of the feasible sets. The positive effect occurs 
because these sets are more appropriate, but, all things being equal, their larger size is 
an increase in ambiguity. Thus when the ND method is used in conjunction with a 
camera which supports the diagonal model, then, as the results here show, the method 
can lead to a decrease in performance. In our experiments on generated data, the 
negative effect is present in the case of MV, but in the case of averaging, the effect is 
always slightly positive. When ND is used in conjunction with the new solution 
method, the results are also minimally compromised by this negative effect. This is 
very promising, because, in general, the diagonal model will be less appropriate, and 
the method will go from having little negative impact, to having a substantial 
positive effect. This has already been shown in the case of the Kodak DCS-200 
camera, as well as when the number of surfaces is large. Increasing the number of 
surfaces does not, of course, reduce the efficacy of the diagonal model, but under these 
conditions, the diagonal model becomes a limiting factor. 
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Fig. 5; Algorithm chromaticity performance versus the number of surfaces in generated 
scenes, showing the selected gamut mapping algorithms, including ones with the new 
solution selection method. These results are for the Sony DXC-930 video camera. The 
error in the plotted values is roughly 2%. 
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Number of Surfaces 


4 


8 


16 


ECRULE-MV 


0.064 


0.044 


0.032 


ECRULE-ICA 


0.058 


0.050 


0.044 


ECRULE-SCWIA-3 


0.057 


0.051 


0.045 


ECRULE-SCWIA-6 


0.054 


0.043 


0.036 


ECRULE-SCWIA-9 


0.054 


0.041 


0.032 


ECRULE-SCWIA-12 


0.055 


0.040 


0.031 


ECRULE-SCWIA-18 


0.057 


0.040 


0.030 


ECRULE-SCWIA-24 


0.058 


0.041 


0.029 


ND-ECRULE-MV 


0.065 


0.047 


0.033 


ND-ECRULE-ICA 


0.057 


0.049 


0.043 


ND-ECRULE-SCWIA-3 


0.060 


0.054 


0.048 


ND-ECRULE-SCWIA-6 


0.054 


0.044 


0.036 


ND-ECRULE-SCWIA-9 


0.054 


0.041 


0.031 


ND-ECRULE-SC WI A- 1 2 


0.055 


0.041 


0.030 


ND-ECRULE-SC WI A- 1 8 


0.057 


0.041 


0.029 


ND-ECRULE-SCWIA-24 


0.059 


0.042 


0.029 



Table 1: Algorithm chromaticity performance for some of the algorithms developed 
here, together with the original methods, for generated scenes with 4, 8, and 16 
surfaces. The numbers are the RMS value of 1000 measurements. The error in the 
values is roughly 2%. 



Finally we turn to results with 321 carefully calibrated images. These images 
were of 30 scenes under 1 1 different illuminants (9 were culled due to problems). The 
images are describe more fully in [4] . Figure 6 shows the 30 scenes used. We provide 
the results of some of the algorithms discussed above, as well as several comparison 
methods. We use NOTHING to indicated the result of no colour constancy processing, 
and AVE-ILLUM for guessing that the illuminant is the average of a normalized 
illuminant database. The method labeled MAX estimates the illuminant RGB by the 
maximum found in each channel. GW estimates the illuminant based on the image 
average on the assumption that the average is the response to a perfect grey. DB-GW 
is similar, except that the average is now assumed to be the response to grey as 
defined by the average of a reflectance database. CIP-ICA is essentially a chromaticity 
version of ECRULE-ICA described in [11]. The method labeled NEURAL-NET is 
another chromaticity oriented algorithm which uses a neural net to estimate the 
illuminant chromaticity [17-19]. C-by-C-MAP is the Colour by Correlation method 
using the maximum posterior estimator [20]. Finally, C-by-C-MSE is Colour by 
Correlation using the minimum mean square error estimate. All these comparison 
methods are described in detail in [4] 

Table 2 shows the results over the 321 test images. The results from the 
image data generally confirm those from the generated data in the case of the new 
selection method. On the other hand, the ND method improves matters significantly 
in only one, case, has essentially no effect in several others, and when used in 
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conjunction with the new selection method, it has a small negative effect. Since the 
camera used already supports the diagonal model well, these varied results are 
understandable. 





Solution Selection Method (If A 


pplicable) 




MV 


AVEdCA 


SCWIA-6 


SCWIA-9 


SCWIA-12 


SCWIA-15 


CRULE 


0.045 


0.046 










ECRULE 


0.041 


0.047 


0.043 


0.039 


0.037 


0.037 


ND-CRULE 


0.047 


0.039 










ND-ECRULE 


0.042 


0.048 


0.045 


0.041 


0.040 


0.040 


NOTHING 


0.125 


AVF.-TT.T.UM 


0.094 


GW 


0.106 


DB-GW 


0.088 


MAX 


0.062 


CIP-ICA 


0.081 


NEURAL-NET 


0.069 


C-by-C-MAP 


0.072 


C-by-C-MMSE 


0.070 



Table 2: The image data results of the new algorithms compared to related 
algorithms. The numbers presented here are the RMS value of the results for 321 
images. Assuming normal statistics, the error in these numbers is roughly 4%. 



6 Conclusion 

We have described two improvements to gamut mapping colour constancy. These 
improvements are important because earlier work has shown that this approach is 
already one of the most promising. For the first improvement we modified the 
canonical gamuts used by these algorithms to account for expected failures of the 
diagonal model. When used with a camera which does not support the diagonal model 
very well, the new method was clearly superior. When used with a camera with sharp 
sensors, the resulting method improved gamut mapping algorithms when the solution 
was chosen by averaging. When the maximum volume heuristic was used, there was a 
slight decrease in performance. This decrease was erased when the method was 
combined with the second improvement. Furthermore, we posit that any decreases in 
performance must be balanced against the increased stability of the new method as the 
number of surfaces becomes large. 

We are also encouraged by the results of the new method for choosing the 
solution. Our findings contribute to the understanding of the relative behavior of the 
two existing methods. Furthermore, the flexibility of the new method allows us to 
select a variant which works better than either of the two existing methods for the 
kind input we are most interested in. 
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Abstract. This paper presents a theoretically very simple yet efficient approach 
for gray scale and rotation invariant texture classification based on local binary 
patterns and nonparametric discrimination of sample and prototype distribu- 
tions. The proposed approach is very robust in terms of gray scale variations, 
since the operators are by definition invariant against any monotonic transforma- 
tion of the gray scale. Another advantage is computational simplicity, as the 
operators can be realized with a few operations in a small neighborhood and a 
lookup table. Excellent experimental results obtained in two true problems of 
rotation invariance, where the classifier is trained at one particular rotation angle 
and tested with samples from other rotation angles, demonstrate that good dis- 
crimination can be achieved with the statistics of simple rotation invariant local 
binary patterns. These operators characterize the spatial configuration of local 
image texture and the performance can be further improved by combining them 
with rotation invariant variance measures that characterize the contrast of local 
image texture. The joint distributions of these orthogonal measures are shown to 
be very powerful tools for rotation invariant texture analysis. 

1 Introduction 

Real world textures can occur at arbitrary rotations and they may be subjected to vary- 
ing illumination conditions. This has inspired few studies on gray scale and rotation 
invariant texture analysis, which presented methods for incorporating both types of 
invariance [2,14]. A larger number of papers have been published on plain rotation 
invariant analysis, among others [4,5,6,7,8,12], while [3] proposed an approach to 
encompass invariance with respect to another important property, spatial scale, in con- 
junction with rotation invariance. 

Both Chen and Kundu [2] and Wu and Wei [14] approached gray scale invariance 
hy assuming that the gray scale transformation is a linear function. This is a somewhat 
strong simplification, which may limit the usefulness of the proposed methods. Chen 
and Kundu realized gray scale invariance by global normalization of the input image 
using histogram equalization. This is not a general solution, however, as global histo- 
gram equalization can not correct intraimage (local) gray scale variations. Another 
problem of many approaches to rotation invariant texture analysis is their computa- 
tional complexity (e.g. [3]), which may render them impractical. 

In this study we propose a theoretically and computationally simple approach 
which is robust in terms of gray scale variations and which is shown to discriminate 
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rotated textures efficiently. Extending our earlier work [9,10,11], we present a truly 
gray scale and rotation invariant texture operator based on local binary patterns. Start- 
ing from the joint distribution of gray values of a circularly symmetric neighbor set of 
eight pixels in a 3x3 neighborhood, we derive an operator that is by definition invariant 
against any monotonic transformation of the gray scale. Rotation invariance is 
achieved by recognizing that this gray scale invariant operator incorporates a fixed set 
of rotation invariant patterns. 

The novel contribution of this work is to use only a limited subset of ‘uniform’ pat- 
terns instead of all rotation invariant patterns, which improves the rotation invariance 
considerably. We call this operator The use of only ‘uniform’ patterns is 

motivated by the reasoning that they tolerate rotation better because they contain fewer 
spatial transitions exposed to unwanted changes upon rotation. This approximation is 
also supported by the fact that these ‘uniform’ patterns tend to dominate in determinis- 
tic textures, which is demonstrated using a sample image data. Further, we propose 
operator called which enhances the angular resolution of LBPg’^'^^ by con- 

sidering a circularly symmetric set of 16 pixels in a 5x5 neighborhood. 

These operators are excellent measures of the spatial structure of local image tex- 
ture, but they by definition discard the other important property of local image texture, 
contrast, since it depends on the gray scale. We characterize contrast with rotation 
invariant variance measures named VARg and VARjg, corresponding to the circularly 
symmetric neighbor set where they are computed. We present the joint distributions of 
these complementary measures as powerful tools for rotation invariant texture classifi- 
cation. As the classification rule we employ nonparametric discrimination of sample 
and prototype distributions based on a log-likelihood measure of the (dis)similarity of 
histograms. 

The performance of the proposed approach is demonstrated with two problems 
used in recent studies on rotation invariant texture classification [4,12]. In addition to 
the original experimental setups we also consider more challenging cases, where the 
texture classifier is trained at one particular rotation angle and then tested with samples 
from other rotation angles. Excellent experimental results demonstrate that the texture 
representation obtained at a specific rotation angle generalizes to other rotation angles. 
The proposed operators are also computationally attractive, as they can be realized 
with a few operations in a small neighborhood and a lookup table. 

The paper is organized as follows. The derivation of the operators and the classifi- 
cation principle are described in Section 2. Experimental results are presented in Sec- 
tion 3 and Section 4 concludes the paper. 

2 Gray Scale and Rotation Invariant Local Binary Patterns 

We start the derivation of our gray scale and rotation invariant texture operator by 
defining texture 7 in a local 3x3 neighborhood of a monochrome texture image as the 
joint distribution of the gray levels of the nine image pixels: 



T = P{gQ,gi,g2,g^,gA,gs,g6,gl,g^) 



( 1 ) 
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where gj (i= 0 ,..., 8 ), correspond to the gray values of the pixels in the 3 x 3 neighborhood 
according to the spatial layout illustrated in Fig. 1 . The gray values of diagonal pixels 
(g2, g4, g|5, and gg) are determined by interpolation. 
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Fig. 1. The circularly symmetric neighbor set of eight pixels in a 3x3 neighborhood. 

2.1 Achieving Gray Scale Invariance 

As the first step towards gray scale invariance we subtract, without losing information, 
the gray value of the center pixel (gg) from the gray values of the eight surrounding 
pixels of the circularly symmetric neighborhood (g,-, i=l,..., 8 ) giving: 



T = pigQ, gi - go > 83 - 80’ 8a - 8q, 85 - 8q, 86 - 8g, 87 ~ 80’ gg “ ^ o ) ( 2 ) 

Next, we assume that differences gi-gg are independent of gg, which allows us to 
factorize Eq.( 2 ): 

T p{go)p{gx- gg, 82-80^83-80,84-80, 8s- 80, 86- 8Q,8i-8Q,8i- 80) ( 3 ) 

In practice an exact independence is not warranted, hence the factorized distribu- 
tion is only an approximation of the joint distribution. However, we are willing to 
accept the possible small loss in information, as it allows us to achieve invariance with 
respect to shifts in gray scale. Namely, the distribution p(gg) in Eq.( 3 ) describes the 
overall luminance of the image, which is unrelated to local image texture, and conse- 
quently does not provide useful information for texture analysis. Hence, much of the 
information in the original joint gray level distribution (Eq.(l)) about the textural char- 
acteristics is conveyed by the joint difference distribution [ 10 ]: 



T P(gi-go,82-8o’83-8o’84-8o^85-8o’86-8o’8i-8o^8&-8o) (4) 



Signed differences gi-gg are not affected by changes in mean luminance, hence the 
joint difference distribution is invariant against gray scale shifts. We achieve invariance 
with respect to the scaling of the gray scale by considering just the signs of the differ- 
ences instead of their exact values: 



T p{s(gi - go), s(g2 - go), ■s(g3 - go), ^(84 - 80)’ ■■■’ ■s(gg - go)) 



( 5 ) 
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where 



i-(x) = 



l,x 0 

0, X < 0 



( 6 ) 



If we formulate Eq.(5) slightly differently, we obtain an expression similar to the 
LBP (Local Binary Pattern) operator we proposed in [9]; 

8 

LBPg= s(g,-go)2‘-^ (7) 

1 = 1 

The two differences between LBPg and the LBP operator of [9] are: 1) the pixels in 
the neighbor set are indexed so that they form a circular chain, and 2) the gray values 
of the diagonal pixels are determined by interpolation. Both modifications are neces- 
sary to obtain the circularly symmetric neighbor set, which allows for deriving a rota- 
tion invariant version of LBPg. For notational reasons we augment LBP with subscript 
8 to denote that the LBPg operator is determined from the 8 pixels in a 3x3 neighbor- 
hood. The name ‘Local Binary Pattern’ reflects the nature of the operator, i.e. a local 
neighborhood is thresholded at the gray value of the center pixel into a binary pattern. 
LBPg operator is by definition invariant against any monotonic transformation of the 
gray scale, i.e. as long as the order of the gray values stays the same, the output of the 
LBPg operator remains constant. 

2.2 Achieving Rotation Invariance 

The LBPg operator produces 256 (2^) different output values, corresponding to the 256 
different binary patterns that can be formed by the eight pixels in the neighbor set. 
When the image is rotated, the gray values g,- will correspondingly move along the 
perimeter of the circle around g^. Since we always assign gj to be the gray value of 
element (0,1), to the right of gg, rotating a particular binary pattern naturally results in 
a different LBPg value. This does not apply to patterns OOOOOOOO 2 and IIIIIIII 2 
which remain constant at all rotation angles. To remove the effect of rotation, i.e. to 
assign a unique identifier to each rotation invariant local binary pattern we define: 

= min{ROR(LBP^,i) \ i= 0,1,...,7} (8) 

where ROR(x,i) performs a circular bit-wise right shift on the 8-bit numbers f times. In 
terms of image pixels Eq.(8) simply corresponds to rotating the neighbor set clockwise 
so many times that a maximal number of the most significant bits, starting from gg, are 

0. We observe that LBPg"^^ can have 36 different values, corresponding to the 36 
unique rotation invariant local binary patterns illustrated in Fig. 2, hence the super- 
script LBPg’^^® quantifies the occurrence statistics of these patterns corresponding 
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to certain microfeatures in the image, hence the patterns can be considered as feature 
detectors. For example, pattern #0 detects bright spots, #8 dark spots and flat areas, and 
#4 edges. Hence, we have obtained the gray scale and rotation invariant operator 
that we designated as LBPROT in [1 1]. 



3o *4° ♦ 5° • 6° ° 1 ^ ° 8° 



o 



0*0 








o 

o 



Fig. 2. The 36 unique rotation invariant binary patterns that can occur in the eight pixel circu- 
larly symmetric neighbor set. Black and white circles correspond to bit values of 0 and 1 in the 
8-bit output of the LBP§ operator. The first row contains the nine ‘uniform’ patterns, and the 

numbers inside them correspond to their unique LBPg™^ values. 



2.3 Improved Rotation Invariance with ‘Uniform’ Patterns 

However, our practical experience has showed that LBPg"^'^® as such does not provide a 
very good discrimination, as we also concluded in [11]. There are two reasons: 

1) the performance of the 36 individual patterns in discrimination of rotated tex- 
tures varies greatly: while some patterns sustain rotation quite well, other patterns do 
not and only confuse the analysis. Consequently, using all 36 patterns leads to a subop- 
timal result (addressed in this section). 

2) crude quantization of the angular space at 45° intervals (addressed in Section 
2.4). 

The varying performance of individual patterns attributes to the spatial structure of 
the patterns. To quantify this we define an uniformity measure U(‘pattern’), which cor- 
responds to the number of spatial transitions (bitwise 0/1 changes) in the ‘pattern’. For 
example, patterns OOOOOOOO 2 and IIIIIIII 2 have U value of 0, while the other seven 
patterns in the first row of Fig. 2 have U value of 2, as there are exactly two 0/1 transi- 
tions in the pattern. Similarly, other 27 patterns have U value of at least 4. 

We argue that the larger the uniformity value U of a pattern is, i.e. the larger num- 
ber of spatial transitions occurs in the pattern, the more likely the pattern is to change 
to a different pattern upon rotation in digital domain. Based on this argument we desig- 
nate patterns that have U value of at most 2 as ‘uniform’ and propose the following 
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operator for gray scale and rotation invariant texture description instead of LBP 3 "^^: 

8 

LBP"“^ = ^^Si-8o) if U(LBP^) 2 

8 ! = 1 ^ ^ 

9 otherwise 

Eq.(9) corresponds to giving an unique label to the nine ‘uniform’ patterns illus- 
trated in the first row of Fig. 2 (label corresponds to the number of ‘ 1’ hits in the pat- 
tern), the 27 other patterns being grouped under the ‘miscellaneous’ label (9). 
Superscript corresponds to the use of rotation invariant ‘wniform’ patterns that 
have U value of at most 2. 

The selection of ‘uniform’ patterns with the simultaneous compression of ‘nonuni- 
form’ patterns is also supported hy the fact that the former tend to dominate in deter- 
ministic textures. This is studied in more detail in Section 3 using the image data of the 
experiments. In practice the mapping from LBP 3 to LBP 3 "“^, which has 10 distinct 
output values, is best implemented with a lookup table of 256 elements. 

2.4 Improved Angular Resolution with a 16 Pixel Neighborhood 

We noted earlier that the rotation invariance of LBP 3 "“^ is hampered by the crude 45° 
quantization of the angular space provided by the neighbor set of eight pixels. To 
address this we present a modification, where the angular space is quantized at a hner 
resolution of 22.5° intervals. This is accomplished with the circularly symmetric 
neighbor set of 16 pixels illustrated in Fig. 3. Again, the gray values of neighbors 
which do not fall exactly in the center of pixels are estimated by interpolation. Note 
that we increase the size of the local neighborhood to 5x5 pixels, as the eight added 
neighbors would not provide too much new information if inserted into the 3x3 neigh- 
borhood. An additional advantage is the different spatial resolution, if we should want 
to perform multiresolution analysis. 
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Fig. 3. The circularly symmetric neighbor set of 16 pixels in a 5x5 neighborhood. 

Following the derivation of LBP 3 , we first define the 16-bit version of the rotation 
variant LBP; 
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16 

LSPjg = s(,h.-h^)2‘-^ (10) 

i - 1 



The LBP]g operator has 65536 (2^^) different output values and 243 different rota- 
tion invariant patterns can occur in the circularly symmetric set of 16 pixels. Choosing 
again the ‘uniform’ rotation invariant patterns that have at most two 0/1 transitions, we 
define the 16-bit version of LBP 3 “^'“^: 

16 

^ sih^-ho) iiUiLBP,,) 2 
17 otherwise 

Thus, the LBPjg“^'“^ operator has 18 distinct output values, of which values from 0 
(pattern OOOOOOOOOOOOOOOO 2 ) to 16 (pattern 1111111111111111 2 ) correspond to the 
number of 1 bits in the 17 unique ‘uniform’ rotation invariant patterns, and value 17 
denotes the ‘miscellaneous’ class of all ‘nonuniform’ patterns. In practice the mapping 
from LBP]g to LBP]g'^'“^ is implemented with a lookup table of 2^® elements. 

2.5 Rotation Invariant Variance Measures of the Contrast of Local Image Texture 

Generally, image texture is regarded as a two dimensional phenomenon that can be 
characterized with two orthogonal properties, spatial structure (pattern) and contrast 
(the ‘amount’ of local image texture). In terms of gray scale and rotation invariant tex- 
ture description these two are an interesting pair: whereas spatial pattern is affected by 
rotation, contrast is not, and vice versa, whereas contrast is affected by the gray scale, 
spatial pattern is not. Consequently, as long as we want to restrict ourselves to pure 
gray scale invariant texture analysis, contrast is of no interest, as it depends on the gray 
scale. 

The LBPg'^'^^ and LBPjg“^'“^ operators are true gray scale invariant measures, i.e. 
their output is not affected by any monotonic transformation of the gray scale. They 
are excellent measures of the spatial pattern, but by definition discard contrast. If we 
under stable lighting conditions wanted to incorporate the contrast of local image tex- 
ture as well, we can measure it with rotation invariant measures of local variance: 



VAR, = I (gi- 

i — 1 



, where 



^^^16 = (^- i 6 r 



16 



16 



, where le = jg K 



i= 1 



( 12 ) 



(13) 
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VARg and VAR|g are by definition invariant against shifts in gray scale. Since LBP 
and VAR are complementary, their joint distributions LBPg^^^/VARg and LBP^g™^/ 
VAR[g are very powerful rotation invariant measures of local image texture. 

2.6 Nonparametric Classification Principle 

In the classification phase a test sample S was assigned to the class of the model M that 
maximized the log-likelihood measure: 

B 

L(S,M) = 5^1ogM^ (14) 

b = 1 

where B is the number of bins, and and correspond to the sample and model 
probabilities at bin b, respectively. This nonparametric (pseudo-)metric measures like- 
lihoods that samples are from alternative texture classes, based on exact probabilities 
of feature values of pre-classified texture prototypes. In the case of the joint distribu- 
tions LBPs^^'^^AA^Rs and LBPig™^/VARig, the log-likelihood measure (Eq.(14)) was 
extended in a straightforward manner to scan through the two-dimensional histograms. 

Sample and model distributions were obtained by scanning the texture samples and 
prototypes with the chosen operator, and dividing the distributions of operator outputs 
into histograms having a fixed number of B bins. Since LBPg’^'^^ and LBPjg"“^ have a 
completely defined set of discrete output values, they do not require any additional bin- 
ning procedure, but the operator outputs are directly accumulated into a histogram of 
10 (LBPg™2) or 18 (LBPi 6™2) bins. 

Variance measures VARg and VAR^g have a continuous- valued output, hence quan- 
tization of their feature space is required. This was done by adding together feature 
distributions for every single model image in a total distribution, which was divided 
into B bins having an equal number of entries. Hence, the cut values of the bins of the 
histograms corresponded to the (100/B) percentile of the combined data. Deriving the 
cut values from the total distribution and allocating every bin the same amount of the 
combined data guarantees that the highest resolution of quantization is used where the 
number of entries is largest and vice versa. The number of bins used in the quantiza- 
tion of the feature space is of some importance, as histograms with a too modest num- 
ber of bins fail to provide enough discriminative information about the distributions. 
On the other hand, since the distributions have a finite number of entries, a too large 
number of bins may lead to sparse and unstable histograms. As a rule of thumb, statis- 
tics literature often proposes that an average number of 10 entries per bin should be 
sufficient. In the experiments we set the value of B so that this condition was satisfied. 

3 Experiments 

We demonstrate the performance of our operators with two different texture image 
data that have been used in recent studies on rotation invariant texture classification 
[4,12]. In both cases we first replicate the original experimental setup as carefully as 
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possible, to get comparable results. Since the training data included samples from sev- 
eral rotation angles, we also present results for a more challenging setup, where the 
samples of just one particular rotation angle are used for training the texture classifier, 
which is then tested with the samples of the other rotation angles. 

However, we first report classification results for the problem that we used in our 
recent study on rotation invariant texture analysis [11]. There we achieved an error rate 
of 39 . 2 % with the LBP 3 "^^ (LBPROT) operator when using 64x64 samples, while 
LBPg“^'“^ and LBPjg™^ operators provide error rates of 25.5% and 8.0%, respectively. 
These improvements underline the benefits of using ‘uniform’ patterns and finer quan- 
tization of the angular space. 

Before going into the experiments we use the image data to take a quick look at the 
statistical foundation of LBPg™^ and LBP]g'^'“^. In the case of LBPg'^'“^ we choose 
nine ‘uniform’ patterns out of the 36 possible patterns, merging the remaining 27 
under the ‘miscellaneous’ label. Similarly, in the case of LBPjg“^'“^ we consider only 
7% (17 out of 243) of the possible rotation invariant patterns. Taking into account a 
minority of the possible patterns, and merging a majority of them, could imply that we 
are throwing away most of the pattern information. However, this is not the case, as the 
‘uniform’ patterns tend to be the dominant structure. 

For example, in the case of the image data of Experiment #2, the nine ‘uniform’ 
patterns of LBPg™^ contribute from 88% up to 94% of the total pattern data, averaging 
90.9%. The most frequent individual pattern is symmetric edge detector 00001 III 2 
with about 25% share, followed by 000001 II 2 and 00011 11 12 with about 15% each. 
As expected, in the case of the 17 ‘uniform’ patterns contribute a smaller 

proportion of the image data, from 70% up to 84% of the total pattern data, averaging 
76.3%. The most frequent pattern is again symmetric edge detector 
0000000011 1 1 1 1 1 12 with about 9.3% share. 

3.1 Experiment #1 

In their comprehensive study Porter and Canagarajah [12] presented three feature 
extraction schemes for rotation invariant texture classification, employing the wavelet 
transform, a circularly symmetric Gabor filter and a Gaussian Markov Random Field 
with a circularly symmetric neighbor set. They concluded that the wavelet-based 
approach was the most accurate and exhibited the best noise performance, having also 
the lowest computational complexity. 

Image Data and Experimental Setup. Image data included 16 texture classes from 
the Brodatz album [1] shown in Fig. 4. For each texture class there were eight 256x256 
images, of which the first was used for training the classifier, while the other seven 
images were used to test the classifier. Rotated textures were created from these source 
images using a proprietary interpolation program that produced images of 180x180 
pixels in size. If the rotation angle was a multiple of 90 degrees (0° or 90° in the case 
of present ten rotation angles), a small amount of artificial blur was added to the origi- 
nal images to simulate the effect of blurring on rotation at other angles. 
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In the original experimental setup the texture classifier was trained with several 
16x16 subimages extracted from the training image. This fairly small size of training 
samples increases the difficulty of the problem nicely. The training set comprised rota- 
tion angles 0°, 30°, 45°, and 60°, while the textures for classification were presented at 
rotation angles 20°, 70°, 90°, 120°, 135°, and 150°. Consequently, the test data 
included 672 samples, 42 (6 angles x 7 images) for each of the 16 texture classes. 
Using a Mahalanobis distance classifier Porfer and Canagarajah reported 95.8% accu- 
racy for the rotation invariant wavelet-based features as the best result. 





MATTING 70° 



PIGSKIN 120^’ 



RATTAN 150° 



STRAW 30° 



WEAVE 45° 



WOOD 60° 



WOOL 70° 



Eig. 4. Texture images of Experiment #1 printed at particular orientations. Textures were pre- 
sented at ten different angles: 0°, 20°, 30°, 45°, 60°, 70°, 90°, 120°, 135°, and 150°. Images are 
180x180 pixels in size. 



Experimental Results. We started replicating the original experimental setup by 
dividing the 180x180 images of the four training angles (0°, 30°, 45°, and 60°) into 
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121 disjoint 16x16 subimages. In other words we had 7744 training samples, 484 (4 
angles x 121 samples) in each of the 16 texture classes. We first computed the histo- 
gram of the chosen operator for each of the 16x16 samples. We then added the histo- 
grams of all samples belonging to a particular class into one big model histogram for 
this class, since the histograms of single 16x16 samples would be too sparse to be reli- 
able models. Also, using 7744 different models would result in computational over- 
head, for in the classification phase the sample histograms are compared to every 
model histogram. Consequently, we obtained 16 reliable model histograms containing 
108900 and VAR 3 with a 1 pixel border produce 15^ entries for a 16x16 

sample) or 94864 (LBPjg™^ and VARjg have a 2 pixel border) entries. 

The performance of the operators was evaluated with the 672 testing images. The 
sample histogram contained 32041/31684 entries, hence we did not have to worry 
about their stability. Classification results (the percentage of misclassified samples 
from all classified samples) for fhe four individual operators and the two joint distribu- 
tions are given in Table 1 . 



Table 1: Error rates (%) for the original experimental setup, where training is done 



with rotations 0°, 


30°, 45°, 


and 60°. 








OPERATOR 


BINS 


ERROR 


OPERATOR 


BINS 


ERROR 


LBP8°“2 


10 


11.76 


VARg 


128 


4.46 


LBPjgriu2 


18 


1.49 


VAR 16 


128 


11.61 


LBPg™2/VAR8 


10/16 


1.64 


LBPjgriu2/vARjg 


18/16 


0.15 



As expected, LBPjg"“^ clearly outperforms its 8 -bit version LBPg’^'^^. LBPg’^'^^ has 
difficulties in discriminating strongly oriented textures straw (66.7% error, 28 samples 
misclassified as grass), rattan (64.3%, 27 samples misclassified as wood) and wood 
(33.3% error, 14 samples misclassified as rattan), which contribute 69 of the 79 mis- 
classified samples. Interestingly, in all 79 cases the model of the true class ranks sec- 
ond right after the nearest model of a false class that leads to misclassification. The 
distribution of rotation angles among the misclassified samples is surprisingly even, as 
all six testing angles confribute from 10 to 16 misclassified samples (16, 16, 14, 10, 13, 
10 ). LBPjg"“^ does much better, classifying all samples correctly except ten grass 
samples that are assigned to leather. Again, in all ten cases the model of the true class 
grass ranks second. 

We see that combining the EBP operators with the VAR measures, which do not do 
too badly by themselves, improves the performance considerably. In the case of 
LBP 3 ’^'“^/VAR 3 the 1.64% error is caused by 11 straw samples erroneously assigned to 
class grass. In ten of the 1 1 misclassifications the model of the straw class ranks sec- 
ond, once third. LBP[g”“^/VARjg falls one sample short of a faultless result, as a straw 
sample at 90° angle is labeled as grass. 
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Note that we voluntarily discarded the knowledge that training samples come from 
four different rotation angles, merging all sample histograms into a single model for 
each texture class. Hence the final texture model is an ‘average’ of the models of the 
four training angles, which actually decreases the performance to a certain extent. If 
we had used four separate models, one for each training angle, for example LBPjg™^/ 
VARjg would have provided a perfect classihcation result, and the error rate of 
would have decreased by 50% to 0.74%. 

Even though a direct comparison to the results of Porter and Canagarajah may not 
be meaningful due to the different classification principle, the excellent results for 
LBPi 6™2 and LBPjg™^/VARjg demonstrate their suitability for rotation invariant tex- 
ture classification. 

Table 2 presents results for a more challenging experimental setup, where the clas- 
sifier is trained with samples of just one rotation angle and tested with samples of other 
nine rotation angles. We trained the classifier with the 121 16x16 samples extracted 
from the designated training image, again merging the histograms of the 16x16 sam- 
ples of a particular texture class into one model histogram. The classiher was tested 
with the samples obtained from the other nine rotation angles of the seven source 
images reserved for testing purposes, totaling 1008 samples, 63 in each of the 16 tex- 
ture classes. Note that in each texture class the seven testing images are physically dif- 
ferent from the one designated training image, hence this setup is a true test for the 
texture operators’ ability to produce a rotation invariant representation of local image 
texture that also generalizes to physically different samples. 



Table 2: Error rates (%) when training is done at just one rotation angle, and the 
average error rate over the ten angles. 



OPERATOR 


BINS 


0" 


20" 


30" 


TRAINING ANGLE 
45" 60" 70" 90" 


120" 


135" 


150" 


AVERAGE 




10 


31.5 


13.7 


15.3 


23.7 


15.1 


15.6 


30.6 


15.8 


23.7 


15.1 


20.00 


LBPjgriu2 


18 


3.8 


1.0 


1.4 


0.9 


1.6 


0.9 


2.4 


1.4 


1.2 


2.3 


1.68 


VARg 


128 


7.5 


3.4 


5.4 


6.0 


4.4 


3.1 


6.1 


5.8 


5.4 


4.4 


5.12 


VARj6 


128 


10.1 


15.5 


13.8 


9.5 


12.7 


14.3 


9.0 


10.4 


9.3 


11.5 


11.62 


LBPg™2/vARg 


10/16 


0.9 


5.8 


4.3 


2.7 


4.8 


5.6 


0.7 


4.0 


2.7 


4.4 


3.56 


LBPi6™2;VARi6 


18/16 


0.0 


0.5 


0.6 


0.6 


0.6 


0.4 


0.0 


0.5 


0.5 


0.3 


0.40 



Training with just one rotation angle allows a more conclusive analysis of the rota- 
tion invariance of our operators. For example, it is hardly surprising that LBPg'^'“^ pro- 
vides highest error rates when the training angle is a multiple of 45°. Due to the crude 
quantization of the angular space the presentations learned at 0°, 45°, 90°, or 135° do 
not generalize that well to other angles. 
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LBP[g™^ provides a solid performance with an average error rate of 1.68%. If we 
look at the ranks of the true class in the 169 misclassifications, we see that in every 
case the model of the true class ranks second. There is a strong suspicion that the suh- 
par results for training angles 0° and 90° are due to the artificial hlur added to the orig- 
inal images at angles 0° and 90°. The effect of the hlur can also he seen in the results of 
the joint distributions LBPg^^/VARg and LBPjg'^‘“^/VARjg, which achieve best per- 
formance when the training angle is either 0° or 90°, the 16-bit operator pair providing 
in fact a perfect classification in these cases. Namely, when training is done with some 
other rotation angle, test angles 0° and 90° contribute most of the misclassified sam- 
ples, actually all of them in the case of LBPjg"“^AA^Rjg. Nevertheless, the results for 
LBPjg™^ and LBP|g'^‘“^/VAR|g are quite excellent. 

3.2 Experiment #2 

Haley and Manjunath [4] proposed a method based on a complete space-frequency 
model for rotation-invariant texture classification. They developed a polar analytic 
form of a two-dimensional Gabor wavelet, and used a multiresolution family of these 
wavelets to compute texture microfeatures. Rotation invariance was achieved by trans- 
forming Gabor features into rotation invariant features using autocorrelation and DFT 
magnitudes and by utilizing rotation invariant statistics of rotation dependent features. 
Classification results were presented for two groups of textures, of which we use the 
set of textures available in the WWW [13]. 

Image Data and Experimental Setup. The image data comprised of the 13 textures 
from the Brodatz album shown in Fig. 5. For each texture 512x512 images digitized at 
six different rotation angles (0°, 30°, 60°, 90°, 120°, and 150°) were included. The 
images were divided into 16 disjoint 128x128 subimages, totaling 1248 samples, 96 in 
each of the 13 classes. Half of the subimages, separated in a checkerboard pattern, 
were used to estimate the model parameters, while the other half was used for testing. 
Using a multivariate Gaussian discriminant, Haley and Manjunath reported 96.8% 
classification accuracy. 

Experimental Results. We first replicated the original experiment by computing the 
histograms of the training half of the 128x128 samples, which served as our model his- 
tograms. Since a 128x128 sample produces a sufficient number of entries (16129/ 
15876) for its histogram to be stable, we did not combine individual histograms. Con- 
sequently, we had 624 model histograms in total, 48 (6 angles x 8 images) models for 
each of the 13 texture classes. 

Since the training data includes all rotation angles, this problem is not particularly 
interesting in terms of rotation invariant texture classification and we restrict ourselves 
to merely reporting the error rates in Table 3. Both halves of the mosaic partitioning 
served as the training data in turn, the other being used as test samples, and as the final 
result we provide the average of the error rates of these two cases. We see that the 
results obtained with the joint distributions compare favorably to the 3.2% error rate 
reported by Haley and Manjunath. 








BUBBLES 60‘ 



BARK O' 



GRASS 90' 



PIGSKIN 150*’ I 



I RAFFIA 0° I 



SAND 30' 



BRICK 30 



WATER 90" WEAVE 120" WOOD 150 



WOOL 0" 



Fig. 5. Texture images of Experiment #2 printed at particular rotation angles. Each texture was 
digitized at six angles: 0'’, 30'’, 60'’, 90'’, 120'’, and 150'’. Images are 512x512 pixels in size. 



Table 3: Error rates (%) for the original experimental setup, where training data 
includes all rotation angles. 



OPERATOR BINS 



ERROR 



OPERATOR BINS 



ERROR 



VAR 



LBP 



VAR 



LBP 



LBP,/'“^WARit 18/16 0.48 ( 0.16 & 0 . 64 ) 



LBPg™VVARg 10/16 0.40 ( 0.16 & 0 . 64 ) 
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Again, we constructed a true test of rotation invariant texture classification, where 
the classifier is trained with the samples of just one rotation angle and tested with the 
samples of other five rotation angles. We trained the classifier with the 128x128 sam- 
ples extracted from the 512x512 images of a particular rotation angle, obtaining 208 
models in total, 16 for each of the 13 texture classes. The classifier was then evaluated 
with the 128x128 samples extracted from the 512x512 images of the other five rotation 
angles, totaling 1040 test samples. 



Table 4: Error rates (%) when training is done at just one rotation angle and the 
average error rate over the six rotation angles. 



OPERATOR 


BINS 


r 


TRAINING ANGLE 
30° 60° 90° 120° 


150° 


AVERAGE 


LBP/‘“2 


10 


20.2 


13.7 


13.7 


17.7 


17.0 


8.8 


15.18 


LBPi6™2 


18 


10.4 


8.2 


8.5 


8.6 


8.3 


6.9 


8.46 


VARg 


128 


7.9 


6.2 


4.2 


5.2 


3.9 


3.5 


5.14 


VARi6 


128 


7.6 


6.3 


4.6 


3.8 


3.7 


4.7 


5.11 


LBPg"“2AA.Rg 


10/16 


2.1 


2.5 


0.8 


0.5 


1.2 


0.5 


1.25 


LBPi6™2/vARjg 


18/16 


1.9 


1.0 


0.5 


0.3 


0.2 


0.3 


0.69 



From the error rates in Table 4 we observe that using just one rotation angle for 
training indeed increases the difficulty of the problem quite nicely. If we take a closer 
look at the confusion matrices of (8.4% average error rate), we see that 

about half (246/528) of the misclassifications are due to the samples of the strongly 
oriented texture wood being erroneously assigned to straw. The training angle does not 
seem to affect the classification accuracy too much, as roughly an equal result is 
obtained in all six cases. 

The complementary nature of EBP and VAR operators shows in the excellent 
results for their joint distributions. LBPjg“^'“^/VARjg achieves a very low average error 
rate of 0.69%, which corresponds to just about 7 misclassifications out of 1040 sam- 
ples. Of the 43 misclassifications in total, false assignments of wool samples to pigskin 
contribute 16 and of grass samples to leather 11. It is worth noting that the perfor- 
mance is not sensitive to the quantization of the VAR feature space, as following aver- 
age error rates are obtained by LBPjg™^/VARjg with different numbers of bins: 1.31% 
(18/2), 0.71% (18/4), 0.64% (18/8), 0.69% (18/16), 0.71% (18/32), and 0.74% (18/64). 

4 Discussion 

We presented a theoretically and computationally simple but efficient approach for 
gray scale and rotation invariant texture classification based on local binary patterns 
and nonparametric discrimination of sample and prototype distributions. Excellent 
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experimental results obtained in two problems of true rotation invariance, where the 
classifier was trained at one particular rotation angle and tested with samples from 
other rotation angles, demonstrate that good discrimination can be achieved with the 
occurrence statistics of simple rotation invariant local binary patterns. The proposed 
approach is very robust in terms of gray scale variations, since the operators are by def- 
inition invariant against any monotonic transformation of the gray scale. This should 
make our operators very attractive in situations where varying illumination conditions 
are a concern, e.g. in visual inspection. Computational simplicity is another advantage, 
as the operators can be realized with a few comparisons in a small neighborhood and a 
lookup table. This facilitates a very straightforward and efficient implementation, 
which may be mandatory in time critical applications. If the stability of the gray scale 
is not something to be worried about, performance can be further improved by combin- 
ing the LBPg™^ and operators with rotation invariant variance measures 

VARg and VAR|g that characterize the contrast of local image texture. As we observed 
in the experiments, the joint distributions of these orthogonal operators are very pow- 
erful tools for rotation invariant texture analysis. 

Regarding future work, in this study we reported results for two rotation invariant 
LBP operators having different spatial conhguration of the circularly symmetric neigh- 
bor set, which determines the angular resolution. As expected, LBPjg"^'“^ with its more 
precise quantization of the angular space provides clearly better classihcation accu- 
racy. Nothing prevents us from using even larger circularly symmetric neighbor sets, 
say 24 or 32 pixels with a suitable spatial predicate, which would offer even better 
angular resolution. Practical implementation will not be as straightforward, though, at 
least not for the 32-bit version. Another interesting and related detail is the spatial size 
of the operators. Some may hnd our experimental results surprisingly good, consider- 
ing how small the support of our operators is for example in comparison to much 
larger Gabor hlters that are often used in texture analysis. However, the built-in sup- 
port of our operators is inherently larger than 3x3 or 5x5, as only a limited subset of 
patterns can reside adjacent to a particular pattern. Still, our operators may not be suit- 
able for discriminating textures where the dominant features appear at a very large 
scale. This can be addressed by increasing the spatial predicate, as the operators can be 
generalized to any neighborhood size. Further, operators with different spatial resolu- 
tions can be combined for multiscale analysis, and ultimately, we would want to incor- 
porate scale invariance, in addition to gray scale and rotation invariance. Another thing 
deserving a closer look is the use of a problem or application specific subset of rotation 
invariant patterns, which may in some cases provide better performance than ‘uniform’ 
patterns. Patterns or pattern combinations are evaluated with some criterion, e.g. clas- 
sification accuracy on a training data, and the combination providing the best accuracy 
is chosen. Since combinatorial explosion may prevent from an exhaustive search 
through all possible subsets, suboptimal solutions such as stepwise or beam search 
should be considered. We also reported that when there are classification errors, the 
model of the true class very often ranks second. This suggests that classihcation could 
be carried out in stages, by selecting features which best discriminate among remain- 
ing alternatives. 
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Note 

Texture images used in this study, together with other imagery used in our published 
work, can be downloaded from http://www.ee.oulu.fi/research/imxig/texture. 
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Abstract. Spectral analysis provides a powerful means of estimating 
the perspective pose of texture planes. Unfortunately, one of the problems 
that restricts the utility of the method is the need to set the size of the 
spectral window. For texture planes viewed under extreme perspective 
distortion, the spectral frequency density may vary rapidly across the 
image plane. If the size of the window is mismatched to the underlying 
texture distribution, then the estimated frequency spectrum may become 
severely defocussed. This in turn limits the accuracy of perspective pose 
estimation. The aim in this paper is to describe an adaptive method 
for setting the size of the spectral window. We provide an analysis which 
shows that there is a window size that minimises the degree of defocusing. 
The minimum is located through an analysis of the spectral covariance 
matrix. We experiment with the new method on both synthetic and real 
world imagery. This demonstrates that the method provides accurate 
pose angle estimates, even when the slant angle is large. We also provide 
a comparison of the accuracy of perspective pose estimation that results 
both from our adaptive scale method and with one of fixed scale. 



1 Introduction 

Key to shape-from-texture is the problem of estimating the orientation of pla- 
nar patches from the perspective foreshortening of regular surface patterns HI2|. 
Conventionally, the problem is solved by estimating the direction and magnitude 
of the texture gradient |3|. Geometrically, the texture gradient determines the 
tilt direction of the plane in the line-of-sight of the observer. Broadly, speaking 
there are two ways in which the texture gradients can be used for shape esti- 
mation. The first of these is to perform a structural analysis of pre-segmented 
texture primitives in terms of the geometry of edges, lines or arcs [MIbIbj . The 
second approach is to cast the problem of shape-from-texture in the frequency 
domain The main advantage of the frequency domain approach is 

that it does not require an image segmentation as a pre-requisite. For this reason 
it is potentially more robust than its structural counterpart. 

The observation underpinning this paper is that there is a basic chicken 
and egg problem which limits the estimation of perspective pose from spectral 
information. Before reliable local spectra can be estimated, there needs to be 
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an estimate of the local distortion of the texture so that the size of the spectral 
window can be set. However, this local distortion is, after all, the ultimate goal 
of perspective pose recovery. The main problem stems from the fact that if the 
window size is incorrectly set then the local estimate of the texture spectrum 
becomes defocussed. 

This defocusing has serious implications for shape- from-texture. Most me- 
thods for recovering perspective pose or determining surface shape rely on finding 
spectral correspondences mm . If the local spectrograms become defocussed, 
then the process of matching corresponding spectral components can frustrated 
by delocalisation or merging. This problem arrises in two important situations. 
The first of these is for texture planes that are subjected to severe perspective 
foreshortening. The second is for highly curved surfaces. In other words the effect 
is most marked when the depth of the plane is small compared to the focal length 
of the camera and when the slant angle of the plane is large. Distant texture 
elements appear smaller while closer ones appear bigger. In order to accurately 
quantify the information provided by this texture gradient we must be able to 
locally adapt the size of the spectral window. 

Most the methods listed above opt to use a spectral window of fixed size. In 
other words, the accuracy of their perspective pose estimates are likely to be limi- 
ted by defocusing. There are two exceptions. Carding and Lindeberg !E! address 
the scale problem emploing a Gaussian scale-space decomposition locally over 
the structural primitives. Stone and Isard US! have a method which interleaves 
the adjustment of local filters for adaptive scale edge detection and the estima- 
tion of planar orientation in an iterative feedback loop. Although this provides 
a means of overcoming the chicken and egg nature of the estimation problem, it 
is couched in terms of structural textures and is sensitive to initialisation. The 
aim in this paper, on the other hand, is to improve the accuracy of perspective 
pose estimation by providing a means of adaptively and locally setting the size 
of the spectral window. The work commences from a spectral domain analysis 
where we show that there is a critical window size that minimises the degree of 
defocusing. In order to provide a way of locally estimating this window size, we 
turn to the covariance matrix for the two components of the texture spectrum. 
Our search for the optimum window size is guided by the bias- variance structure 
of the covariance matrix. The size of the local spectral window is varied until 
the determinant of the spectral covariance matrix is minimised. 



2 Motivation 

As pointed out earlier, the selection of the physical scale for local spectral de- 
scriptors is critical to accurate shape from texture. To illustrate this point. Fi- 
gure n shows a doubly sinusoidal texture viewed under perspective projection. 
The perspective distortion of the texture is controlled by the slant-angle of the 
texture-plane, which in this case is in the horizontal direction. The main feature 
to note is that the spatial frequency of the texture increases in the slant direction. 
To accurately estimate the frequency content using local spectral descriptors, it 
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is important that the local sampling window is adapted to avoid distortion of the 
measurements. If the size of the window is fixed, then the computed descriptors 
will not accurately reflect the local spectral content at the relevant point on the 
image plane. Moreover, if the window is too large then the texture spectra will 
be defocussed. This problem is also illustrated in Figure 1. The local spectra 
estimated at the three marked locations are ordered in columns. The top row 
shows the spectra estimated with a fixed window, while the lower row shows the 
spectra obtained if the optimally sized sampling window is used. The main fea- 
ture to note is that the spectra estimated with the fixed size window are blurred. 
The spectra estimated with the optimally sized window, on the other hand, are 
crisp. 




Fig. 1. Projected artificial texture with squares showing the window sizes employed by 
the spectral estimator together with the power spectrum response using a fixed data 
window and our adaptive window. 



3 Texture Gradient as a Time-Varying Signal 

As we mentioned above, there is a chicken and egg problem which hinders the 
estimation of perspective pose from spectral information. The slant parameter 
induces a texture gradient across the texture plane. However, in order to recover 
this gradient, and hence estimate perspective pose, we must adapt the spectral 
window to the appropriate local scale. If the window size is not chosen correctly, 
then the local estimate of the texture spectra will be defocussed. In this section 
we provide a simple model which can be used to understand how the defocusing 
varies with the choice of window size. This analysis reveals that there is a local 
window size that minimises the defocusing. 

To estimate the optimal local window size we require a means of estimating 
the degree of defocusing. Here we use the covariance matrix for the two spectral 
components as a measure of spectral dispersion. We locate the optimal window 
size by minimising the determinant of the spectral covariance matrix. 
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3.1 Perspective Projection 

To commence our analysis of the spectral defocusing due to perspective fores- 
hortening we first review the underlying geometry. We therefore commence by 
reviewing the projective geometry for the perspective transformation of points 
on a plane. Specifically, we are interested in the perspective transformation bet- 
ween the object-centred co-ordinates of the points on the texture plane and the 
viewer-centred co-ordinates of the corresponding points on the image plane. To 
be more formal, suppose that the texture plane is a distance h from the camera 
which has focal length / < 0. Consider two corresponding points. The point with 
co-ordinates Xt = {xt,yt, Zt)’^ lies on the texture plane while the corresponding 
point on the image plane has co-ordinates Xi = (xi,yi,/)^. We represent the 
orientation of the viewed texture plane in the image plane co-ordinate system 
using the slant a and tilt r angles m For a given plane, the slant is the angle 
between viewer line of sight and the normal vector of the plane. The tilt is 
the angle of rotation of the normal vector to the texture plane around the line 
of sight axis. Furthermore, since we regard the texture as being “painted” on 
the texture plane, the texture height zt is always equal to zero. With these in- 
gredients the perspective transformation between the texture-plane and image 
co-ordinate systems is given in matrix form by 



Xt 

Vi 

Zi 



/ 

h — Xt sin (7 



COS a cos T — sin r sin a cos r 




Xt 
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( 1 ) 



The first term inside the curly braces represents the rotation of the texture 
plane in slant and tilt. The second term represents the displacement of the rota- 
ted plane along the optic axis. Finally, the multiplying term outside the braces 
represents the non-linear foreshortening in the slant direction. When expressed 
in this way, Zi is always equal to / since the image is formed at the focal plane of 
the camera. As a result we can confine our attention to the following simplified 
transformation of the x and y co-ordinates 
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This transformation can be represented using the shorthand {xi,yi)'^ = Tp{xt, 
yt)"^ , where Tp is the 2x2 transformation matrix. As written above, the trans- 
formation Tp can be considered as a composition of two transformations. The 
first of these is a non-uniform scaling proportional to the displacement in the 
slant direction. The second transformation is a counterclockwise rotation by an 
amount equal to the tilt angle. 



3.2 Unidimensional Texture Gradient 

We now turn our attention to the effect of perspective foreshortening on the 
frequency contents of the texture plane. To simplify our analysis we confine our 
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Fig. 2. Simplification of the geometric model for perspective projection 



attention to the line of maximum texture gradient. This line points in the tilt 
direction. Figure 0 illustrates the idea. 

Without loss of generality, we can assume that the line in question is aligned 
with the x-axis in both the image and texture plane co-ordinate systems. From 
Equation 0 the relationship between the two co-ordinate systems is given by 



fxt cos a 

Xi = ^ 

n — xt sm (7 



( 3 ) 



while the inverse transformation is 



r — • I j \ ^ 

Xi sm a + j cos tr 

As a result the perspective distortion of the ID texture signal depends only on 
the slant angle. 

Now consider a simple image model in which the texture is represented by 
the following sinusoidal variation in intensity on the texture-plane 



I{xt) = COs(27TU’oXt) 



( 5 ) 



where ojo is the frequency of the texture pattern. Using Equation (0 the pro- 
jected version of the texture-pattern in the image plane is given by 



I'(xi) = cos 



27rojo( 



Xi sin a + f cos a 



( 6 ) 



The local frequency content is a function of the parameters of the ID perespective 
projection, i.e. the height h, focal length / and slant angle cr, together with the 
position on the image plane, i.e. Xi. The instantaneous frequency at the position 
Xi is given by 
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The texture gradient is related to the derivative of the instantaneuous frequency 
derivative 



, _ df^jxi) _ fhuJo cos a 

\Xij „ / * I j* \ 9 V®/ 

oxi (XiSma + j cos a) ^ 

This ID texture-gradient model is similar to a linear FM chirp waveform. The 
finite chirp waveform is a time-varying signal whose the frequency increases or 
decreases as a linear function of time. In our case, however, the perspective 
projection imposes a non-linear rate of variation for the frequency with time. 
However, when tantr « -^ then the frequency modulation follows the appro- 
ximate linear form 



n'{xi) 



hujQ 

/coscr 




2xi tancr 



/ 



(9) 



Figure 0 illustrates the plot of the 1-D texture gradient represented by Equa- 
tion 0. The rate of change for the frequency in terms of time is also shown in 
the Figure. 





(a) 



(b) 



Fig. 3. (a) Texture gradient as a time varying signal (non-linear chirp), (b) Time 
varying Instantaneous frequency. 



The chirp is a non-stationary signal. Its spectral density covers a broad band 
of frequencies which vary in time. In other words, it has an evolutionary broad 
band spectrum. In order to analyse signals of this form, one approach is to 
minimise the observation period while maintaining a reasonable spectral resolu- 
tion HS|. The Fourier spectrum of a broad band signal is continuous and covers a 
wide range of frequencies. At this point it is important to point out an important 
limitation of the Fourier transform. When the signal is periodic and sufficiently 
stationary, the Fourier coefficients converge rapidly. Unfortunately, this is not 
the case for non-stationary signals. As a result the Fourier transform itself is not 
satisfactory for analysing signals whose spectra vary significantly with time. 
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3.3 Windowing 

When analysing signals of a non-stationary nature, it is often important to un- 
derstand the correlation between the time domain and frequency domain re- 
presentations of the signal. The Fourier transform itself, provides information 
about the frequency domain. However, the time localisation of the frequency 
information is essentially lost in the process of computing the Fourier transform. 
Another representational difficulty that can arise for non-stationary signals is 
related to uniqueness. Two non-stationary signals that have completely different 
periodicity can produce very similar spectra. As a result, Fourier analysis alone 
is insufficient to represent time-varying signals. 

An alternative is to perform a time-spectral analysis. A non-stationary sig- 
nal is divided into a sequence of time slices within which the signal is quasi- 
stationary. This is the idea exemplified by the short-time Fourier transform of 
Gabor j1 b) . This method uses a sliding- window Fourier transform. The rela- 
tionship with the conventional Fourier transform is captured by the following 
definition 



F{uj,ts) = J^{g{t-ts)xf{t)} (10) 

where g {t — ts) is a short-time window which has a fixed width shifted along the 
time axis by an amount ts- 

This window operation allows us to locally analyse the spectral energy con- 
tent of a signal over a given time interval. If the window width is sufficiently 
small, then its spectral content can be approximated by the instantaneous fre- 
quency at the center point of the window. The instantaneous frequency at a 
specific time is given by the derivative of the angular argument of the signal at a 
fixed time. Equation ( 0 , represents a non-linear chirp waveform whose instanta- 
neous frequency varies linearly with time. The instantaneous frequency is given 
by Equation (|2D . Using the Fourier duality theorem, in the time domain the win- 
dowing operation is equivalent to a frequency domain convolution of the Fourier 
transform of the signal with its windowing function. This convolution introdu- 
ces a blurring or defocusing in the Fourier domain. The amount of defocusing 
is inversely proportional to the width of the windowing function. It originates 
from the main lobe broadening introduced by the windowing function. 

3.4 Defocusing 

We will now provide an analysis of the blurring of the frequency domain spectrum 
which results from the windowing process. Consider the effect of windowing the 
signal given in Equation (jSI) with a rectangular window function of width T. The 
spectral bandwidth is the range of spectral content enclosed under a specific time 
interval. Using Equation (jSI), the spectral bandwidth of the 1-D texture gradient 
for a given window of width T is given by: 

fhojo cos (7 



B = 2x I?'(T) = 2 



{T sincr -I- /coscr)^ 
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Fig. 4. Illustration of the time-bandwidth tradeoff for the limited time broad spectrum 

In other words, the signal can be compressed into a time interval of length of 
\jB. However, we must add the excess bandwidth introduced by the windo- 
wing operation. Here there are two opposing effects. The spectral bandwidth of 
the signal decreases with the window size. The excess bandwidth, on the other 
hand, increases as the window size decreases. This is basically the tradeoff of 
time-localisation and frequency-resolution. Figure El illustrates the trade-off. Ac- 
cording to the figure we have the bandwidth B extended by an excess due to the 
window mainlobe contribution. The aim here is to minimise the defocusing due 
to the band excess while reducing the size of the window. Figure 0 shows the 
individual contributions to the time-bandwitdth from: (a) a rectangular window 
and (b) a 1-D texture gradient of size T. This plot indicates the theoretical mi- 
nimum of the defocusing trade-off illustrated by Figure 0 In Figure 0 we show 
a plot of the signal defocusing as a function of the window size T. The mini- 
mum corresponds to the optimum window size for the given bandwidth of the 
non-stationary signal. 

This analysis suggests how we might reduce the defocusing of the Fourier 
spectrum by adapting the width of the local data window. The optimum size 
is reached when the blurring of the local spectrum is minimised. This optimum 
window width itself leads to an optimum estimation of the instantaneous fre- 
quency of the signal at a specific point. The idea underpinning this paper is to 
exploit this property to develop an unsupervised adaptive version of the short 
time Fourier transform. We aim to exploit the method to avoid the defocusing 
of the spectral representation and hence provide accurate estimation of the local 
spectral frequency. 

3.5 Minimising the Defocusing of the Local 2-D Spectral 
Distribution 

At this point we are ready to extend our unidimensional analysis to the two- 
dimensional analysis of textured images viewed under perspective geometry. To 
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Fig. 5. Time-bandwitdth plot for a rectangular window of size T together with Time- 
bandwitdth plot for the 1-D textnre gradient window of size T 



achieve this goal we analyse the texture gradient using local estimates of the 
power spectrum based on a quilt of patches on the image plane. The 2-D spectral 
defocusing is modelled using the covariance matrix to describe the dispersion of 
the spectral distribution. In this way we can determine the degree of blur in the 
spectral energy. 

We commence by defining our local spectral estimator. In order to obtain 
a smooth spectral response we use the Blackman-Tukey(BT) power spectrum 
estimator. This is defined to be the frequency response of the windowed auto- 
correlation function. We employ a triangular smoothing window w (X) due to 
its well documented spectral stability mi. The spectral estimator is then 



where r^x is the estimated autocorrelation function of the image patch. To find 
the optimally sized spectral window, we require a measure of spectral dispersion. 
Here we use the covariance matrix for the two spectral components. Formally, 
the matrix is defined as follows 



The eigenvalues Ai and A 2 of the spectral covariance matrix are maximum and 
minimum values of spectral variance in an orthogonal co-ordinate system. The 
co-ordinate system is aligned in the direction of maximum spectral variance. If 
we regard the two eigenvalues as representing the two radii of a spectral ellipse, 
then the spectral area is equal to 



P{Vi)BT = Hrx. (Xi) xu;(Xi)} 



( 12 ) 
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Fig. 6. Variation of the spectral defocusing in terms of the window width T. 




We locate the optimal spatial domain sampling window by varying its size 
until the spectral area, i.e. the determinant of the spectral covariance matrix, is 
minimised. This in turn ensures that the dispersion, or defocusing, of the spectral 
moments is minimised. 

4 Experiments 

In this Section we provide some experiments with the new scale adaptation 
method. We commence in Figure Q by showing a plot of the measure of spectral 
dispersion, i.e. \Su\ as a function of the size of the spatial window. The curve 
corresponds to the central patch marked in Figure 0 Notice that the curve has 
a deep minimum. The curve also has the same gross structure as the defocusing 
curve shown in Figure El 

To show that the selected window size returns accurate estimates of local 
frequency, in Figure El we apply the adaptive window to the 2-D FM chirp 
f{x,y) = cos(27ro;oa^^). The figure shows the estimated instantaneous frequency 
as a function of the horizontal position x. The figure shows the curves of the 
theoretical value of the instantaneous frequency together with the value esti- 
mated with our adaptive choice of window size. Also shown on the plot is the 
estimated value using a fixed scale method. The main feature to note is that the 
result obtained with the adaptive window agrees well with ground-truth. The 
fixed scale method consistently overestimates the frequency. 

To proceed, we illustrate how the adaptive window responds to increasing 
perspectivity. In each panel of Figure 0 the left hand image shows an artifi- 
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Fig. 7. Experimental plot of \Su\ over increasing scale. The minimum of the distribu- 
tion indicates a suitable choice for the size of the spectral analysing window. 



cial planar texture surface when oriented at various slant angles to the camera 
viewing direction. In each case there is a clear variation in the density of the 
texture primitives over the images. This variation occurs mainly in the tilt di- 
rection, which is also the direction of the texture gradient. Superimposed on the 
images are the areas of the local windows which minimise the determinant of the 
spectral covariance matrix. One feature to note is that each window contains ap- 
proximately the same number of texture primitives. This is an important feature 
since the energy relationships between the image patches can only be preserved 
under scale consistency. 

Turning our attention to the estimated spectra, Figure 0 illustrates the de- 
focusing produced by a poor choice of window length. In the panels of Figure El 
the remaining columns show the spectra estimated at the marked positions on 
the planes. In each case the top row shows the results obtained using an fixed 
window, while the lower row shows the adaptive window spectra. 

We now furnish some results produced when our adaptive scale algorithm is 
applied to real-world textures m- Figure uni shows three textures taken from 
the Brodatz album m- The images shown are projected at 45 degrees of slant. 
The marked patches are again correspond to the optimal local spectral windows. 

The final piece of experimentation aims to illustrate the utility of the new 
method in improving the accuracy of perspective pose estimation. The results 
are reported for the textures shown in Figures 6 and 8. Table 1 compares the 
estimated perspective parameters with ground truth for the artificial textures 
shown in Figure El The algorithm used in these experiments is described in EDI- 
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Fig. 8. Instantaneous frequency of a 2-D FM chirp f{x,y) = cos(27raioa:^) calculated 
along the horizontal direction. 



The main feature to note from this table is that if we use the adaptive spectral 
window, then we can recover accurate estimates of slant and tilt even when the 
planes are highly inclined. Table 2 presents the estimated perspective parameters 
for the three Brodatz textures shown in Figure ITHl From these results, it is clear 
the improvement in the pose parameter estimate even for non-regular textures. 

5 Conclusions 

The main contribution in this paper has been to provide a new technique for ad- 
aptively setting the size of the local spectral window for estimating perspective 
pose from frequency information. The aim is to minimise the degree of defocu- 
sing that results if a fixed window of inappropriate size is used. The criterion 
underpinning our method is the determinant of the spectral covariance matrix. 
Based on an experimental study, we show that the new method leads to improved 
estimates of perspective pose. 

Our future plans revolve around using the new method to estimate shape 
from the texture distribution of curved objects. Suffice to say that studies aimed 
at addressing this topic are in hand and will be reported in due course. 
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(b) 




(c) 



(d) 



Fig. 9. Estimated scales for rectangular windows for several views of a texture plane. 
The corresponding spectral response is shown on the right of each fignre. First row: 
hxed scale estimator; Second row: adaptive estimator. 




(a) 



(b) 



(c) 



Fig. 10. Estimated scales for rectangular windows for Brodatz textures at 45 degrees 
slant, (a) D14 - aluminum wire; (b) D21 - French canvas; (c) D36 - Lizard skin. 
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TABLE 1 - Slant and tilt values - Fixed and Adaptive Methods (Artificial Textures) 
Fixed Scale Method Adaptive Scale Method 
actual values estimated abs. error estimated abs. error 

(g) (r) (cr’) (r’) o' r’ (g’) (r’) cr’ r’ 

(a) 10 0 12.0 0.0 2.0 0.0 9.8 0.0 0.2 0.0 

(b) 30 0 27.5 0.0 2.5 0.0 29.5 0.0 0.5 0.0 

(c) 60 0 62.1 1.0 2.1 1.0 59.4 0.0 0.6 0.0 

( d ) 80 o | 73.4 2.7 6.6 2 . 7 | 80.4 0.2 0.4 0.2 

TABLE 2 - Slant and tilt values - Fixed and Adaptive Methods(Brodatz Textures) 
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Abstract. This paper describes a novel view- based learning algorithm 
for 3D object recognition from 2D images using a network of linear units. 
The SNoW learning architecture is a sparse network of linear functions 
over a pre-dehned or incrementally learned feature space and is specifi- 
cally tailored for learning in the presence of a very large number of fea- 
tures. We use pixel-based and edge-based representations in large scale 
object recognition experiments in which the performance of SNoW is 
compared with that of Support Vector Machines (SVMs) and nearest 
neighbor using the 100 objects in the Columbia Image Object Database 
(COIL-100). Experimental results show that the SNoW-based method 
outperforms the SVM-based system in terms of recognition rate and the 
computational cost involved in learning. Most importantly, SNoW’s per- 
formance degrades more gracefully when the training data contains fewer 
views. The empirical results also provide insight into practical and theo- 
retical considerations on view-based methods for 3D object recognition. 



1 Introduction 

View-based object recognition has attracted much attention in recent years. 
In contrast to methods that rely on pre-dehned geometric (shape) models for 
recognition, view-based methods learn a model of the object’s appearance in 
a two-dimensional image under different poses and illumination conditions. At 
evaluation time, given a two-dimensional image, the learned model is used to 
determine if the target object is present in the image or not. 

Among the view-based object recognition methods, parametric eigenspace 
UK (El and support vector machine approaches PI have demonstrated ex- 
cellent recognition results on the COIL-20 and COIL-100 databases. Although 
these systems can recognize objects in almost real-time, the computational cost 
involved in learning is extremely high. Consequently these methods are typically 
demonstrated using only small, often different and unspecihed subsets of objects 
from the whole database, which makes fair comparison of results difficult. More 
signihcantly, the training sets used in previous experimental studies consist of 
images taken in nearby poses (usually 10° apart). This particular experimental 
setup, as we will show, makes the learning problem less challenging. In order to 
study algorithms in a somewhat more realistic situation, it is of great interest 
to compare the performance of these methods when only a limited number of 
views of the objects are presented during training. 



D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 4.39- B^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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In this work, we propose a method that applies the SNoW (Sparse Network of 
Winnows) learning algorithm m PI (available at http://L2R.cs.uiuc.edu/ 
~cogcomp.html) to 3D object recognition and compare its performance with 
SVM and nearest neighbor methods. SNoW is a sparse network of linear fun- 
ctions that utilizes the Winnow update rule 0. It is specifically tailored to 
learning in domains in which the potential number of features taking part in 
decisions is very large (and may be unknown a priori), although only a small 
number of them is typically relevant to a decision. Some of the characteristics 
of this learning architecture are its sparsely connected units, the allocation of 
features and links in a data driven way, the decision mechanism and the uti- 
lization of a feature-efficient update rule. An additional property of the SNoW 
architecture that makes it attractive for learning in vision is that it learns a 
representation for each object rather than a discrimination rule for each pair, as 
do other methods. This allows for more appealing evaluation schemes and for 
the incorporation of external information sources into the process of learning a 
representation and recognizing an object. SNoW has been used successfully on 
a variety of large scale learning tasks in natural language processing uni m and 
recently, on face detection izg. 

This paper is organized as follows. Previous work on view-based methods 
that learn to recognize 3D objects is described in Section 0 which also provides 
details on the use of SVMs for this problem. The SNoW learning architecture 
and its use for object recognition are presented in Sectional Section 0 presents 
the experimental setup and an experimental comparison of the proposed me- 
thod with SVM and nearest neighbor. The experimental comparison focuses on 
varying the number of view points and, for SNoW, also on different image re- 
presentations. We conclude with some comments on these learning methods and 
future work in Section 0 

2 View-Based Methods 

The appearance of an object is the combined effects of its shape, reflectance 
properties, pose, and the illumination in the scene. While shape and reflectance 
are intrinsic properties that do not change for a rigid object, pose and illumina- 
tion vary from one scene to another. View-based recognition methods attempt 
to use data observed under different poses and illumination conditions to learn 
a compact model of the object’s appearance; this, in turn, is used to resolve the 
recognition problem from view points that were not observed previously. 

A number of view-based schemes have been developed to recognize 3D ob- 
jects. Poggio and Edelman HSl show that 3D objects can be recognized from the 
raw intensity values in 2D images (we call this representation here a pixel-based 
representation) using a network of generalized radial basis functions. They argue 
and demonstrate that full 3D structure of an object can be estimated if enough 
2D views of the object are provided. Turk and Pentland HB) demonstrate that 
human faces can be represented and recognized by “eigenfaces.” Representing a 
face image as a vector of pixel values, the eigenfaces are the eigenvectors associa- 
ted with the largest eigenvalues which are computed from a covariance matrix 
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of the sample vectors. An attractive feature of this method is that the eigenfaces 
can be learned from the sample images in pixel representation without any fea- 
ture selection. The eigenspace approach has since been used in different vision 
tasks from face recognition to object tracking. Murase and Nayar m ini deve- 
lop a parametric eigenspace method to recognize 3D objects directly from their 
appearance. For each object of interest, a set of images in which the object ap- 
pears in different poses is obtained as training examples. Next, the eigenvectors 
are computed from the covariance matrix of the training set. The set of images 
is projected to a low dimensional subspace spanned by a subset of eigenvectors, 
in which the object is represented as a manifold. A compact parametric model 
is constructed by interpolating the points in the subspace. In recognition, the 
image of a test object is projected to the subspace and the object is recognized 
based on the manifold it lies on. Using a subset of the Columbia Object Image 
Library (COIL-IOO), they show that 3D objects can be recognized accurately 
from their appearances in real-time. 

In contrast to these algebraic methods, general purpose learning methods 
such as support vector machines (SVMs) have also been used for this problem. 
Scholkopf |I3 was the first to apply SVMs to recognize 3D objects from 2D 
images and has demonstrated the potential of this approach in visual learning. 
Pontil and Verri H3I also used SVMs for 3D object recognition and experimented 
with a subset of the COIL-IOO dataset. Their training set consisted of 36 images 
(one for every 10°) for each of the 32 objects they chose, and the test sets 
consist of the remaining 36 images for each object. For 20 random selections of 
32 objects from the COIL-IOO, the system achieves perfect recognition rate (but 
see comments on that in Sec. 0. More recently, a subset of the COIL-IOO has 
been used by also Roobaert and Van Hulle m to compare the performance of 
SVMs with different pixel-based input representations. 

Given the success of this approach, which we use to compare with the ap- 
proach presented here, we present below the SVM method in some more details. 

2.1 Support Vector Machines 

The Support Vector Machine (SVM) |2Dj P] is a general purpose learning me- 
thod for pattern recognition and regression problems that is based on the theory 
of structural risk minimization. According to the structural risk minimization 
inductive principle, a function that describes the training data well and belongs 
to a set of functions with low VC dimensiorfl will generalize well (that is, will gu- 
arantee a small expected recognition error for the unseen data points) regardless 
of the dimensionality of the input space 1213!. Based on this principle, the SVM is 
a systematic approach to find a linear function (a hyperplane) that belongs to a 
set of functions of this forms with the lowest VC dimension. The reason for using 
a linear function is that for a set of linearly separable points, it is possible to 
explicitly quantify the VC dimension in terms of the minimal distance between 
positive and negative points. SVMs provide non-linear function approximations 

^ The VC dimension of a class of functions is a combinatorial parameter that measures 
the richness of the function class. See m n for details. 
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by mapping the input vectors into a high dimensional feature space where a 
linear hyperplane that separates the data exists. It can also be extended to cases 
where the best hyperplane in the resulting high dimension space does not quite 
separate all the data points. 

Given a set of samples (xi,?/i), (x2,?/2), i^hUi) where x^ £ is the 
input vector and yi £ {—1, 1} its label, an SVM aims to find an optimal hyper- 
plane that leaves the largest possible fraction of data points of the same class on 
the same side while maximizes the distance of either class from the hyperplane 
(margin distance). Vapnik [f20] shows that maximizing the margin distance is 
equivalent to minimizing the VC dimension and therefore contributes to better 
generalization. The problem of finding the optimal hyperplane is thus posed as 
a constrained optimization problem and solved using quadratic programming 
techniques. The optimal hyperplane, which determines the class label of a data 
point X £ , is of the form 



where is a kernel function and sgn is the function that outputs -1-1 on 

positive inputs and —1 otherwise. Constructing an optimal hyperplane is equi- 
valent to determining the nonzero OiS. Sample vectors x^ that corresponds to a 
nonzero are called the support vectors (SVs) of the optimal hyperplane. The 
hope, when using this method, is for a small number of support vectors, thereby 
producing a compact classifier. 

The use of kernel functions allows, using Mercer theorem, to avoid the need 
to blow up the dimensionality in order to reach a state in which the sample is 
linearly separable. If the kernel is of the form fc(x,Xi) = ^^(x) • ^(x^) for some 
nonlinear function <P : R^ — ?> , M ^ N, the computation can be done in 

the original, lower dimension space rather than working in the M dimensional 
space, although the hypoerplane is constructed in R^ . For a linear SVM, the 
kernel function is simply the dot product of vectors in the input space. Several 
kernel functions, such as polynomial functions and radial basis functions, have 
been shown to satisfy Mercer theorem and used in nonlinear SVM, allowing the 
construction of a variety of learning machines, some of which coincide with clas- 
sical architectures. However, this also results in a drawback since one needs to 
find the “right” kernel function when using SVMs. It is interesting to observe, 
though, that although the use of kernel functions seems to be one of the advan- 
tages of SVMs from a theoretical point of view, most experimental studies have 
used linear SVMs which were found to perform better. One potential reason is 
that SVMs are prone to outliers and various kinds of noise in the data, and this 
gets worse when non-linear kernels are used. 

3 A SNoW-Based Approach to Object Recognition 

The SNoW learning architecture, the focus of this work, also finds its origin in 
computational learning theory ca and relies on VC theory to relate its beha- 
vior on the training data to that on unseen test data, just like SVMs. And, as 
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in SVMs, the learning architecture uses linear functions. One main difference 
is the way the linear functions are learned from the data. However, the focus 
in SNoW is significantly different in several respects. First, it attmepts to learn 
representations for the target concepts, rather than discriminators. This allows 
SNoW units to serve as input when learning other, more involved representati- 
ons m- Second, the empahsis is on an arhictecture and algorithms that can 
deal efficiently with very high dimensional spaces, both in terms of the numebr of 
examples required to learn and in terms of the computational complexity of lear- 
ning and evaluation. Therefore, generating expressive features (i.e., blowing up 
the dimensionality of the space) in order to guarantee linear separability, if nee- 
ded, is not a problem. This algorithmic aspect makes the architecture especially 
advantagueous when the function space is sparse (that is, the target definition 
depends on a few relevant attributes relative to the overall dimensionality of the 
space) and when not all the attributes that are used to describe instances are 
known ahead of time. This latter property is important for scalability, but also 
for future considerations when, for example, additional information sources may 
become available to a recognition module only in later stages of the process. In 
this section, we first present the SNoW learning architecture and algorithm, and 
then describe how we apply SNoW algorithm to 3D object recognition. 

3.1 The SNoW Architecture 

The SNoW (Sparse Network of Winnow^) learning architecture is a sparse net- 
work of linear units over a common pre-defined or incrementally learned feature 
space. Nodes in the input layer of the network represent simple relations over 
the input instance and are being used as the input features. Each linear unit is 
called a target node and represents relations or concepts which are of interest 
over the input; in the current application, target nodes represent a definition of 
an object in terms of the relations (features) extracted from the 2D image input. 
An input instance is mapped into a set of features which are active in it; this 
representation is presented to the input layer of SNoW and propagates to the 
target nodes. Target nodes are linked via weighted edges to (some of) the input 
features. 

Let At = {zi, . . . ,Zm} be the set of features that are active in an example 
and are linked to the target node t. Then the linear unit corresponding to t is 
active iff 

> Ot, 

ieAt 

where w* is a positive weight on the edge connecting the ith feature to the target 
node t, and 9t is the threshold for the target node t. 

Each SNoW unit may include a collection of subnetworks, one for each of 
the target relations but all using the same feature space. In the current case, we 
may have one unit with target subnetworks for all the target objects or we may 
define different units, each with two competing target objects. A given example 

To winnow: to separate chaff from grain. 
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is treated autonomously by each target subnetwork; an example labeled t may be 
treated as a positive example by the subnetwork for t and as a negative example 
by the rest of the target nodes. 

The learning policy is on-line and mistake-driven; several update rules can 
be used within SNoW. The most successful and the only one used in this work, 
is a variant of Littlestone’s Winnow update rule jS|, a multiplicative update rule 
that we tailored to the situation in which the set of input features is not known 
a priori, as in the infinite attribute model P|. This mechanism is implemented 
via the sparse architecture of SNoW. That is, (1) input features are allocated 
in a data driven way - an input node for the feature i is allocated only if the 
feature i was active in any input sentence and (2) a link (i.e., a non-zero weight) 
exists between a target node t and a feature i if and only if i was active in an 
example labeled t. 

One of the important properties of the sparse architecture is that the com- 
plexity of processing an example depends only on the number of features active 
in it, ria, and is independent of the total number of features, rit, observed over 
the life time of the system. This is important in domains in which the total 
number of features is very large, but only a small number of them is active in 
each example. 

The Winnow update rule has, in addition to the threshold 9t at the target t, 
two update parameters: a promotion parameter a > 1 and a demotion parameter 
0 < /3 < 1. These are being used to update the current representation of the 
target t (the set of weights w\) only when a mistake in prediction is made. Let 
At = {ii, ■ ■ ■ , im} be the set of active features that are linked to the target node 
t. If the algorithm predicts 0 (that is, J2ieAt received label 

is 1, the active weights in the current example are promoted in a multiplicative 
fashion: 

Vi e At, w\ a - w\. 

If the algorithm predicts 1 (X)iG. 4 t ^ received label is 0, the active 

weights in the current example are demoted: 



Vi G At, w\ (3 ■ w\. 



All other weights are unchanged. 

The key feature of the Winnow update rule is that the number of examples 
required to learn a linear function grows linearly with the number n^. of rele- 
vant features and only logarithmically with the total number of features. This 
property seems crucial in domains in which the number of potential features is 
vast, but a relatively small number of them is relevant. Moreover, in the sparse 
model, the number of examples required before converging to a linear separator 
that separates the data (provided it exists) scales with 0(ni. log Uq). Winnow is 
known to learn efficiently any linear threshold function and to be robust in the 
presence of various kinds of noise and in cases where no linear-threshold func- 
tion can make perfect classifications, while still maintaining its abovementioned 
dependence on the number of total and relevant attributes 101 HI 
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Once target subnetworks have been learned and the network is being eva- 
luated, a decision support mechanism is employed, which selects the dominant 
active target node in the SNoW unit via a winner-take-all mechanism to produce 
a final prediction. In other applications the decision support mechanism may also 
cache the output and process them along with the output of other SNoW units 
to produce a coherent output. 

Figures mm and 0 provide more details on the SNoW learning architecture. 



SNoW: Objects and Notation 

F = Z'^ = {0, 1, . . . } /* Set of potential features * j 

T = {t\, . . .tk) <Z F /* Set of targets */ 

Ft C F /* Set of features linked to target t */ 

tNET = {[(*, w\) ■. i £ Ft], 9t\ /* The representation of the target t. * / 

activation : T — ^ K /* activation level of a target t. * / 

SNoW = {tMET : t £ T} /* The SNoW Network */ 

e = {ii, . . . , im} C F"* /* An example, represented as a list of active features 

V 



Fig. 1. SNoW: Objects and Notation. 



SNoW: Training and Evaluation 



Training Phase: SNoW-Train (SNoW, e) 

Initially: F't — 4^, for all t ^ T. 

For each t ^ T 

1. UpdateArchitecture (t, e) 

2. Evaluate (t, e) 

3. UpdateWeights (t, e) 



Evaluation Phase: SNoW-Evaluation(SNoW, e) 

For each t G T 

Evaluate (t, e) 

MakeDecision (SNoW, e) 



Fig. 2. SNoW: Training and Evaluation. Training is the learning phase in which the 
network is constructed and weights are adjusted. Evaluation is the phase in which 
the network is evaluated, given an observation. This is a conceptual distinction; in 
principle, one can run in on line mode, in which training is done continuously, even 
when the network is used for evaluating examples. 
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SNoW: Building Blocks 



Procedure Evaluate(t, e) 

activation — 

Procedure UpdateWeights(t, e) 

If {activation(t) > 9-t) & (f 0 e) 
for each i ^ e: ■<— ■ (3 

If {activation{i) < 9-t) & (f G e) 
for each i ^ e: w\ w\ ■ oc 

Procedure UpdateArchitecture(t, e) 

If t G e 

— For each i £ e \ , set lo* = w 

Otherwise: do nothing 

Procedure MakeDecision(SNoW, e) 

Predict winner = argmaxt^Tdc^'i^'^o-tionit) 



/* predicted positive on negative example */ 
/* predicted negative on a positive example */ 



/* Link feature to target; set initial weight */ 
/* Winner-take-all Prediction */ 



Fig. 3. SNoW: Main Procedures. 



3.2 Learning 3D Objects with SNoW 

Applying SNoW to 3D object recognition requires specifying the architecture 
used and the representation chosen for the input images. As described above, 
to perform object recognition we associate a target subnetwork with each target 
object. This target learns a definition of the object in terms of the input features 
extracted from the image. We could either define a single SNoW unit which 
contains target subnetworks for all the 100 different target objects, or we may 
define different units, each with several (e.g., two) competing target objects. 
Selecting a specific architecture makes a difference both in training time, where 
learning a definition for object a makes use of negative examples of other objects 
that are part of the same unit but, more importantly, it makes a difference in 
testing; rather that two competing objects for a decision, there may be a hundred. 
The chances for a spurious mistake caused by an incidental view point are clearly 
much higher. On the other hand, it has significant advantages in terms of space 
complexity and the appeal of the evaluation mode. This point will be discussed 
later. 

An SVM is a two-class classifier which, for an n-class pattern recognition 
problem, trains binary classifiers. Since we compare the performance 

of the proposed SNoW-based method with SVMs, in order to maintain a fair 
comparison we have to perform it in the one- against- one scheme. That is, we use 
SNoW units of size two. To classify a test instance, tournament-like pair-wise 
competition between all the machines is performed and the winner determines 
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the label of the test instance. The recognition rates of the SVM and SNoW based 
methods shown in Table |2| were performed using the one-against-one scheme. 
(That is, we trained = 4,950 classifiers for each method and evaluated 

98(= 50 + 25 + 12 + 6 + 3 + 1 + 1) classifiers on each test instance. 

Pixel-Based Representation The first feature representation used in this 
work is a set of Boolean features that encode the positions and intensity values 
of pixels. Let the (cc, y) pixel of an image with width w and height h have intensity 
value I{x, y) (0 < I{x, y) < 255). This information is encoded as a feature whose 
index is 256 x {y x w + x) + I{x^y). This representation ensures that different 
points in the {position x intensity} space are mapped to different features. 
(That is, the feature indexed 256 x [yxw + x) + I{x,y) is active if and only if the 
intensity in position (x,y) is I{x,y).) In our experiments images are normalizes 
so that w = h = 32. Note that although the number of potential features in our 
representation is 262, 144 (32 x 32 x 256), only 1024 of those are active (present) in 
each example, and it is plausible that many features will never be active. Indeed, 
in one of the experiments, it turned out that only 13,805 of these features were 
ever active. Since the algorithm’s complexity depends on the number of active 
features in an example, rather than the total number of features, the sparseness 
also contributes to efficiency. Also notice that while this representation seems to 
be too simplistic, the performance levels reached with it are surprisingly good. 

Edge-Based Representation Edge information contains significant visual 
cues for human perception and has the potential to provide more information 
than the previous representation and guarantee robustness. Edge-based repre- 
sentations can be used, for example, to obtain a hierarchical description of an 
object. While perceptual grouping has been applied successfully to many vision 
problems including object and face recognition, the grouping procedure is usually 
somewhat arbitrary. This word can this be viewed as a systematic method to 
learn representation of objects based on conjunctions of edges. 

For each image, a Canny edge detector j2] is first applied to extract edges. 
Let I{x,y) represent the (x,y) pixel in an image I. Let E{x,y) be the Canny 
edge map in which E{x,y) = 1 indicates the existence of an edge at I{x,y). To 
prune extraneous small edge fragments and reduce the computation complexity 
we keep only edges with length above some threshold (e.g., 3 pixels). E is the 
resulting edge map after pruning. That is, the pixel I{x,y) is considered to 
contain significant perceptual information to describe the object, when E(x, y) = 
1; otherwise, E{x,y) = 0. For consistency we index an edge using its top left 
pixel. For each pixel we maintain up to two possible edges in the resulting E 
map, a vertical one and a horizontal one, denoted by 

e =< (x, y),d>,d G (u, h}. 

Features are generated to represent conjunctions of size two of these edges. That 
is, features are elements of the cross product 

E X E = |(e, e') |e e'}. 
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This representation thus constitutes a hierarchical representation of an object 
which encodes the spatial relationships of the edges in an object. Figure 4 illust- 
rates the benefit of this encoding in object recognition. It shows two objects with 
very similar appearance for which the edge maps (where the minimum length is 
3) are different. Note that some of edges are blurred or missing because of the 
the aggressive downsampling (from 128 x 128 to 32 x 32 pixels). Nevertheless, 
this difference grows when conjunctions of edges are used. Finally, we note that 
the number of potential features when using this representation is very large, 
but very few of them are active. 





(a) (b) 

object edge 

65 map 



(c) 


(d) 


hor. 


vert. 


edge 


edge 



(e) (f) 

object edge 

13 map 



(g) 


(h) 


hor. 


vert. 


edge 


edge 



Fig. 4. Two objects with similar appearance (in terms of shape and intensity values) 
but their edge maps are very different. Note that some of the edges are blurred or 
missing because of aggressive downsampling (from 128 x 128 to 32 x 32 pixels). 



4 Experiments 

We use the COIL-100 dataset mi to test our method and compare its per- 
formance with other view-based methods in the literature. In this section, we 
describe the characteristics of the COIL-100 dataset, present some experiments 
in which the performance of several methods is compared and discuss the empi- 
rical results. 



4.1 Dataset and Experimental Setups 

We use the Columbia Object Image Library (COIL-100) database in all the 
experiments below. COIL is available at http://www.cs.columbia.edu/CAVE. 
The COIL-100 dataset consists of color images of 100 objects where the images 
of the objects that were taken at pose intervals of 5°, i.e., 72 poses per object. The 
images were also normalized such that the larger of the two object dimensions 
(height and width) fits the image size of 128 x 128 pixels. Figure 0 shows the 
images of the 100 objects taken in frontal view, i.e., zero pose angle. The 32 
highlighted objects in Figure0are considered more difficult to recognize in 
we use all 100 objects including these in our experiments. Each color image is 
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Fig. 5. Columbia Object Image Library (COIL-100) consists of 100 objects of varying 
poses (5° apart). The objects are shown in row order where the highlighted ones are 
considered more difficult to recognize in 



converted to a gray-scale image of 32 x 32 pixels for our experiments. In this 
paper, given the use of the COIL-100 dataset, it is assumed that the illumination 
conditions remain constant and hence object pose is the only variable of interest. 

4.2 Ground Truth of the COIL- 100 Dataset 

At first glance, it seems difficult to recognize the objects in the COIL data- 
set because it consists of a large number of objects with varying pose, texture, 
shape and size. Since each object has 72 images of different poses (5° apart), 
many view-based recognition methods use 36 (10° apart) of them for training 
and the remaining images for testing. However, it turns out that under these 
dense sampling conditions the recognition problem is not difficult (even when 
only grey-level images are used). Namely, in this case, instances that belong to 
the same object are very close to each other in the image space (where each 
data point represents an image of an object in a certain pose). We verified this 
by experimenting with a simple nearest neighbor classifier (using the Euclidean 
distance), resulting in an average recognition rate of 98.50% (54 errors out of 
3,600 tests). Figure ^21 shows some of the objects misclassified by nearest neig- 
hbor method. 

In principle, one may want to avoid using the nearest neighbor method since 
it requires a lot of memory for storing templates and its recognition time com- 
plexity is high. The goal here was simply to show that this simple method is 
comparable to the complex SVM approaches P) HS| for the case of dense sam- 
pling. Therefore, the abovementioned recognition problem is not appropriate for 
comparison among different methods. 

Table 1. Recognition rates of nearest neighbor classifier 



Results 


30 objects 
randomly selected 
from COIL 


32 objects shown 
in Figure^ 
selected by 


The whole 
100 objects 
in COIL 


Errors/Tests 


14/1080 


46/1152 


54/3600 


Recognition rate 


98.70% 


96.00% 


98.50% 
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(a) (b) (c) (d) (e) 

( 8 : 80 , 23 : 85 ) ( 31 : 80 , 79 : 85 ) ( 65 : 270 , 13 : 265 ) ( 69 : 80 , 91 : 75 ) ( 96 : 260 , 69 : 85 ) 



Fig. 6. Mismatched objects using the nearest neighbor method, {x \ a, y ■. b) means 
that object x with view angle a is recognized as object y with view angle b. It shows 
some of the 54 errors (out of 3,600 test samples) made by the nearest neighbor classifier 
when there are 36 views per object in the training set. 



It is interesting to see that the pairs of the objects on which the nearest 
neighbor method misclassified have similar geometric configurations and similar 
poses. A close inspection shows that most of the recognition errors are made 
between the three packs of chewing gums, bottles and cars. Other dense sampling 
cases are easier for this method. Consequently, the set of selected objects in an 
experiment has direct effects on the recognition rate. This needs to be taken 
into account when evaluating results that use only a subset of the 100 objects 
(typically 20 to 30) from the COIL dataset for experiments. Table Q shows the 
recognition rates of nearest neighbor classifiers in several experiments in which 
36 poses of each object are used for templates and the remaining 36 poses are 
used for tests. 

Given this baseline experiment we have decided to perform our experimental 
comparisons in cases in which the number of views of objects available in training 
is limited. 

4.3 Empirical Results Using Pixel-Based Representation 

Table 0shows the recognition rates of the SNoW-based method, the SVM-based 
method (using linear dot product for the kernel function), and the nearest neigh- 
bor classifier using the COIL-100 dataset. The important parameter here is that 
we vary the number of views of an object (n) during training and use the rest 
of the views (72 — n) of an object for testing. 



Table 2. Experimental results of three classifiers using the 100 objects in the COIL-100 
dataset 



Methods 


^ of views /object 


36 


18 


8 


4 


3600 tests 


5400 tests 


6400 tests 


6800 tests 


SNoW 


95.81% 


92.31% 


85.13% 


81.46% 


Linear SVM 


96.03% 


91.30% 


84.80% 


78.50% 


Nearest Neighbor 


98.50% 


87.54% 


79.52% 


74.63% 
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The experimental results show that the SNoW-based method performs as well 
as the SVM-based method when many views of the objects are present during 
training and outperforms SVM-based method when the numbers of views is 
limited. Although it is not surprising to see that the recognition rate decreases 
as the number of views available during training decreases, it is worth noticing 
that both SNoW and SVM are capable of recognizing 3D objects in the COIL-100 
dataset with satisfactory performance if enough views (e.g., > 18) are provided. 
Also they seems to be fairly robust even if only a limited number of views 
(e.g., 8 and 4) are used for training; the performance of both methods degrades 
gracefully. 

To provide some more insight into these methods, we note that in the SVM- 
based methods, only 27.78% (20 out of 72) of the input vectors serves as support 
vectors. For SNoW, out of 262,144 potential features in the pixel-based repre- 
sentation, only 13,805 were active in the dense case (i.e., 36 views). This shows 
the advantage gained from using the sparse architecture. However, only a small 
number of those may be relevant to the representation of each target, as a more 
careful look as the SNoW output hypothesis reveals. 

An additional potential advantage of the SNoW architecture is that it does 
not learn discriminators, but rather can learn a representation for each object, 
which can then be used for prediction in the one-against-all scheme or to build 
hierarchical representations. However, as is shown in Table 0 this implies a 
significant degradation is the performance. Finding a way to make better pre- 
dictions in the one-against-all scheme is one of the important issues for future 
investigation, to better exploit the advantages of this approach. 

Table 3. Recognition rates of SNoW using two learning paradigms 



SNoW 


^ of views /object 


36 


18 


8 


4 


one-against-one 


95.81% 


92.31% 


85.13% 


81.46% 


one-against-all 


90.52% 


84.50% 


81.85% 


76.00% 



4.4 Empirical Results Using Edge-Based Representation 

For each 32 x 32 edge map, we extract horizontal and vertical edges (of length at 
least 3 pixels) and then encode as our features conjunctions of two of these edges. 
The number of potential features of this sort is ( 2 ) = 2,096,128. However, 
only an average of 1,822 of these is active for objects in the COIL-100 dataset. 
To reduce the computational cost the feature vectors were further pruned and 
only the 512 most frequently occurring features were retained in each image. 

Table El shows the performance of the SNoW-based method when conjunc- 
tions of edges are used to represent objects. As before, we vary the number of 
views of an object (n) during training and use the rest of the views (72 — n) 
of an object for testing. The results indicate that conjunctions of edges provide 
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useful information for object recognition and that SNoW is able to learn very 
good object representations using these features. The experimental results also 
exhibit the relative advantage of this representation increases when the number 
of views per object is limited. 



Table 4. Experimental results of SNoW classifier on the COIL-100 dataset using con- 
junction of edges 



SNoW 


^ of views/object 


36 


18 


8 


4 


3600 

tests 


5400 

tests 


6400 

tests 


6800 

tests 


w/ conjunction of edges 


96.25% 


94.13% 


89.23% 


88.28% 


w/ primitive intensity values 


95.81% 


92.31% 


85.13% 


81.46% 



5 Discussion and Conclusion 

We have described a novel view-based learning method for the recognition of 
3D objects using SNoW. Empirical results show that the SNoW-based method 
outperforms other methods in terms of recognition rates except for the dense case 
(36 views). Furthermore, the computational cost of training SNoW is smaller. 

Beyond the experimental study and developing a better understanding for 
how and when to compare experimental approaches, the main contribution of 
this work is in presenting a way to apply the SNoW learning architecture to 
visual learning. 

Unlike previous general purpose learning methods like SVMs, SNoW learns 
representations for objects, which can then be used as input to other, more in- 
volved, visual processes in a hierarchical fashion. An aspect of the recognition 
problem that we have not addressed here is the ability to use the representa- 
tion to quickly prune away objects that cannot be valid targets for a given test 
image so that rapid recognition from among a small set of reasonable candidates 
is performed. We believe that the SNoW architecture, by virtue of learning a 
positive definition for each object rather than a discriminator between any two, 
is more suitable for this problem, and this is one of the future directions we pur- 
sue. For a fair comparison among different methods, this paper uses pixel-based 
presentation in the experiments. However, we view the edge-based representa- 
tion that was found to be even more effective and robust as another starting 
point for future research. We believe that pursing the direction of using com- 
plex intermediate representations will benefit future work on recognition and, 
in particular, robust recognition under various types of noise. We pursue this 
notion also as part of an attempt to provide a learning theory account for the 
object recognition problem using the PAG (Probably Approximately Correct) 
m framework. 
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Abstract. This paper investigates whether regions of uniform surface 
topography can be extracted from intensity images using shape-from- 
shading and subsequently used for the purposes of 3D object recognition. 
We draw on the constant shape-index maximal patch representation of 
Dorai and Jain. We commence by showing that the resulting shape-index 
regions are stable under different viewing angles. Based on this observa- 
tion we investigate the effectiveness of various structural representations 
and region attributes for 3D object recognition. We show that region 
curvedness and a string ordering of the regions according to size pro- 
vides recognition accuracy of about 96%. By polling various recognition 
schemes, including a graph matching method, we show that a recognition 
rate of 98-99% is achievable. 



1 Introduction 

Shape-from-shading is concerned with recovering surface orientation from local 
variations in measured brightness. It was identified by Marr as providing one 
of the key routes to understanding 3D surface structure via the 2^D sketch 
m- Moreover, there is strong psychophysical evidence for its role in surface 
perception and recognition l22l20lbl1,’^l1blT7^ . However, despite considerable 
effort over the past two decades reliable shape-from-shading has proved an elusive 
goal inm- The reasons for this are two- fold. Firstly, the recovery of surface 
orientation from the image irradiance equation is an under-constrained process 
which requires the provision of boundary conditions and constraints on surface 
smoothness to be rendered tractable. Secondly, real-world imagery rarely satisfies 
the constraints needed to render shape-from-shading tractable. As a consequence 
shape-from-shading has suffered from the dual problems of model dominance and 
poor data-closeness. By weakening the data-closeness of the image-irradiance 
equation in favour of smoothness, the recovery of surface detail is sacrificed. This 
in turn has compromised the ability to abstract useful object representations 
from shape-from-shading. As a result there has been little progress in the use of 
shape-from-shading for 3D object recognition from intensity images. 

Recently, we have developed a new shape-from-shading scheme which has 
gone some way to overcoming some of the shortcomings of existing algorithms 
Specifically, we have shown how to restore data-closeness by providing a 
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simple geometric construction which allows the image irradiance equation to be 
satisfied as a hard constraint m- Secondly, we have developed new curvature 
consistency constraints which allow meaningful topographic surface structure to 
be recovered 1^ . 

Having established that we can extract more reliable topographic information 
from the shaded images of surfaces, the aim in this paper is to explore whether 
this information can be used for the purposes of object recognition. Our starting 
point in this paper is the observation that measures of surface topography have 
been used to successfully recognise 3D objects from range images Con- 

ventionally, the approach is to match attributed relational graphs representing 
the adjacency structure of H-K patches (patches of uniform mean and Gaussian 
curvature). However, more recently Dorai and Jain 1911(1 have shown how both 
image histograms and region segmentations generated from the Koenderinck and 
Van Doom shape index |E1 can be used for recognising range images from large 
libraries. 

Our aim in this paper is to investigate whether region segmentations ex- 
tracted from the shape-index delivered by SFS can be used to the purposes of 
recognition. In a recent paper we have shown that histograms of shape-index 
and other curvature attributes can be used for the purposes of recognition m- 
However here we interested in whether the needle-maps recovered by our new 
shape-from-shading scheme are suitable for surface abstraction and structural 
object recognition. We therefore follow Dorai and Jain and extract “constant 
shape-index maximal patches” from the needle-maps 1 1 )) . These correspond 
to surface regions of uniform topographic class. We investigate several struc- 
tural abstractions of the resulting topographic regions. The simplest of these 
is a string which encodes the ordering of the region sizes. The second is the 
region adjacency graph. We also investigate several region attributes including 
shape-index, curvedness and area. The most effective structure is a string of re- 
gion curvedness values which gives a recognition rate of 96%. Finally, by polling 
the various representations, we show that a recognition performance of 98% is 
possible. 



2 Shape-from-Shading 

Our new shape-from-shading algorithm has been demonstrated to deliver needle- 
maps which preserve fine surface detail 12323- The observation underpinning 
the method is that for Lambertian reflectance from a matte surface, the image 
irradiance equation defines a cone of possible surface normal directions. The 
axis of this cone points in the light-source direction and the opening angle is 
determined by the measured brightness. If the recovered needle-map is to satisfy 
the image irradiance equation as a hard constraint, then the surface normals must 
each fall on their respective reflectance cones. Initially, the surface normals are 
positioned so that their projections onto the image plane point in the direction 
of the image gradient. Subsequently, there is iterative adjustment of the surface 
normal directions so as to improve the consistency of the needle-map. In other 
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Fig. 1. Curvatnre consistency SFS applied to David, by Michelangelo. 



words, each surface normal is free to rotate about its reflectance cone in such a 
way as to improve its consistency with its neighbours. This rotation is a two- 
step process. First, we apply a smoothing process to the current surface normal 
estimates. This may be done in a number of ways. The simplest is local averaging. 
More sophisticated alternatives include robust smoothing with outlier reject and, 
smoothing with curvature or image gradient consistency constraints. This results 
in an off-cone direction for the surface normal. The hard data-closeness constraint 
of the image irradiance equation is restored by projecting the smoothed off-cone 
surface normal back onto the nearest position on the reflectance cone. 

To be more formal let s be a unit vector in the light source direction and let 
Eij be the brightness at the image location Further, suppose that n^{i,j) is 
the corresponding estimate of the surface normal at iteration k of the algorithm. 
The image irradiance equation is E{i,j) = n^j.s. As a result, the reflectance 
cone has opening angle cos“^ E{i,j). After local smoothing, the off-cone surface 
normal is The updated on-cone surface normal which satisfies the image 
irradiance equation as a hard constraint is obtained via the rotation = 

. The matrix <P rotates the smoothed off-cone surface normal estimate by 
the angle difference between the apex angle of the cone, and the angle subtended 
between the off-cone normal and the light source direction. This angle is equal 
to 



9 = cos ^ E — 




( 1 ) 



This rotation takes place about the axis whose direction is given by the vector 
{u, V, w)'^ = X s. This rotation axis is perpendicular to both the light source 
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direction and the off-cone normal. Hence, the rotation matrix is 

( c-|-u^c' —ws + uvc' vs + uwc' \ 
ws + uvc' c -I- v^d —us + vwc' 

—vs + uwd us + vwd c + vdd J 

where c = cos 9, d = 1 — c, and s = sin 9. 

The off-cone surface normal is recovered through a process of robust-smoo- 
thing. The smoothness error or consistency of the field of surface normals is 
measured using the derivatives of the needle-map in the x and y directions by 
the penalty function 




In the above measure, Paiv) is the robust error kernel used to gauge the local 
consistency of the needle-map or field of surface normals. The argument of the 
kernel rj is the measured error and the parameter a controls the width of the 
kernel. It is important to note the robust-error kernels are applied separately to 
the magnitudes of the derivatives of the needle-map in the x and y directions. 
Applying variational calculus, the penalty function is minimised by the smoothed 
surface normal 
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Stated in this way the smoothing process is entirely general. However, in our 
previous work we found that the most effective error kernel was the log-cosh 
sigmoidal-derivative M-estimator 



pAv) = - log cosh ( 

7T \ 



a ) 



( 4 ) 



Examples of the needle-maps and the detected surface ridge structures deli- 
vered by our new shape- from-shading method are shown in Figure 1. 
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3 Curvature 

The aim in this paper is to explore the use of various curvature attributes for 
needle-map segmentation and subsequent region-based recognition. The aim in 
segmenting the needle-map is to find regions of uniform topographic class on the 
surface of the viewed object. One of the original ways of segmenting surface data 
according to topography was to use mean and Gaussian curvature labels. The 
eight topographic labels are assigned on the basis of the signs and zeros of the 
two curvatures. However, the topographic labelling is a somewhat cumbersome 
process since it requires the setting of four curvature thresholds. The shape index 
of Koenderinck and van Doom HS| provides a more elegant characterisation of 
surface topography. This is a single continuous measure which encodes the same 
topographic class information as H — K labels in an angular representation, 
without the need to set thresholds. 

The differential structure of a surface is captured by the Hessian matrix, 
which may be written in terms of surface normals as 




where (• • -)^ and (• • -)y denote the x and y components of the parenthesized 
vector respectively. 

The eigenvalues of the Hessian matrix, found by solving the equation I'M. — kI| 
= 0, are the principal curvatures of the surface, denoted Ki, 2 . The mean curvature 
is related to the trace of the Hessian matrix and is given by = |(ki -I- « 2 )- 
The Gaussian curvature is equal to the determinant of the Hessian and is given 
hy K = kiK2- 

The shape index is defined in terms of the principal curvatures using the 
angular measure 

2 K2 + Kl 

(p = — arctan Ki > K 2 (d) 

7T K2 — Kl 

and the overall magnitude of curvature is measured by the curvedness c = 

The relationship between the shape-index, the mean and Gaussian curvatu- 
res, and the topographic class of the underlying surface are summarised in Table 
1. The table lists the topographic classes (i.e. dome, ridge, saddle ridge etc.) and 
the corresponding shape-index interval. We assign the topographic class u>p to 
the pixel indexed p provided that the measured shape-index value (j)p falls within 
the relevant shape-index interval 

4 A Region-Based Structural Representation 

Our aim in this paper is to explore how the topographic labelling delivered 
by shape-from-shading can be used to recognising 3D objects from 2D images. 
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Table 1. Topographic classes. 



Class (oi) 


Symbol 


H 


K 


Region-type 


Shape index interval ( tui ) 


Dome 


D 


- 


-f 


Elliptic 




1.1) 


Ridge 


R 


- 


0 


Parabolic 




3 5\ 
8 ’ 8/ 


Saddle ridge 


SR 


- 


- 


Hyperbolic 




1 3 \ 
8’ 8/ 


Plane 


P 


0 


0 


Hyperbolic 


Undefined 


Saddle-point 


S 


0 


- 


Hyperbolic 






Saddle rut 


c 


-f 


- 


Hyperbolic 


[- 


I.--8) 


Rut 


V 


+ 


0 


Parabolic 


[- 


5 3\ 

8 ’ 8/ 


Cup 


sv 


-f 


-f 


Elliptic 


- 


-1 -^] 
’ 8/ 



The aim is to explore how the structural arrangement of the regions of uniform 
topographic class, together with region attributes derived from the needle-map 
can be used for the purposes of recognition and matching. In this section we 
describe how the region structures and the attributes can be extracted from the 
needle-map and shape-index information delivered by shape-from-shading. 



4.1 Constant Shape Index Maximal Patches 

Our region-based representation borrows some of the features of the COSMOS- 
representation 0 which we consider most likely to be stable when recovered using 
SFS. Specifically, we use the needle-map and topographic labels in tandem to 
generate a rich description of object topography. 

The representation describes an image of an object using a patchwork of 
maximal regions of constant shape-index. These constant shape maximal patches 
(CSMP) are defined on the region of an image, O, corresponding to the object. 
Suppose that each pixel in the image is assigned a topographic label on basis of 
its measured shape-index value and the shape-index intervals for the different 
topographic classes defined in Table 1. Let ujp and u>q be the topographic labels 
assigned to two pixels with pixel positions p and q. Further let T(p, q) be a path 
between these two pixels. A CSMP is a maximally-sized image patch P C O, 
such that VpjVq € P, ujp = ujq and there exists a connected path P(p,q) from 
p to q consisting of points r £ P such that ujp = ujq = ujr- The path condition 
imposes connectedness of the CSMP, defining it as a contiguous image region 
of constant shape index. For example, the image of a sphere should, if ideally 
labeled by our SFS scheme, possess a single CSMP of spherical cap shape index. 

Since we are working with noisy data derived from single images, rather 
than the CAD data and range images investigated by Dorai and Jain 0, our 
regions tend to be relatively small and fragmented (e.g. Figure El). To obtain a 
manageable list of regions, we impose a minimum region size of 25 pixels. Since 
the images used are 128x128, this corresponds to a limiting size of 0.15% of the 
total image area. This typically gives us between 40 and 80 regions per image. 

To demonstrate the stability of the representation. Figure El shows the 
CSMP’s for a sequence of different views of the toy duck from the Columbia 
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University COIL data-base. Notice how the valley lines around the beak and 
the wing are well recovered at each viewing angle, and also how the shape of 
the saddle structure below the wing is maintained. In Figure 0 we show more 
examples of labeling stability using several other object-views drawn from the 
COIL database. 

4.2 Region Attributes 

Having segmented the image into approximate CSMPs we next consider va- 
rious region attributes which can be used to augment the representation for the 
purposes of recognition. The first, and most obvious measures, are the average 
shape-index of the patch itself, and its size. In practice, we normalize the size of 
the regions forming the image to sum to unity, thus providing a degree of scale 
invariance. These two measures suggest a very simple image representation which 
we can use for the purposes of demonstrating the stability of the CSMP’s. We 
sort the CSMPs according to region size and display their associated topogra- 
phic label. In fact this is not only a simple way of visualising the stability of the 
image segmentation, it is also has potential as a simple, compact representation 
in its own right. Figure 2] shows the CSMP region sizes and topographic labels 
for the first 35 segmentations of toy duck In Figure 0 we show the histograms of 
topographic label frequencies for the 20 objects from the COIL data-base. These 
histograms are remarkably similar. It is clear that, whilst the overall shape-index 
histograms may be similar for different objects (Figure EJ, the image structures 
differ appreciably and systematically. Moreover, we see considerable correlation 
between the region-size/shape-index pairings of different views of a given object. 

Besides the CSMP region-sizes and topographic labels, we can add other 
attributes to the representation. For example, in the COSMOS representation, 
Dorai and Jain |3j incorporate the mean normal for each region P, calculated as 
n = We also include the mean curvedness of each CSMP as part of 

the representation. 

For small, compact regions, the variance in the normal directions is likely 
to be small. For extended regions, however, the mean normal may not prove 
representative of the potentially wide range of normal directions present within 
the region. This is a particular problem, since for a compact region it is possible 
to use the normal to adjust the region size, in order to approximate the true area 
of the region on the object, i.e. to compensate for the foreshortening of visible 
surfaces due to their orientation to the viewer. This has the potential to improve 
the viewpoint invariance of the representation, but unfortunately is not feasible 
when extended CMSPs may span a large range of normal directions. 



4.3 Region Adjacency Graph 

Potentially the most important element of our representation is the region- 
adjacency graph (RAG), since this encodes much of the structural information 
about the arrangement of the topographic structures that constitute objects in 
the image. Recovery of the RAG from a region-based description of an image 
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Fig. 2. As the duck rotates by 5° intervals (left to right and top to bottom), the topo- 
graphic labeling remains consistent. The labels are coloured according to the scheme 
proposed by Koenderinck and van Doom IE], The label colours are: green=spherical 
cup, cyan=trough, blue=cylindrical ruts, pale blue=saddles ruts, white=symmetrical 
saddles, background and planar regions, pale yellow=saddle ridge, yellow=cylindrical 
ridge, orange=dome and red=spherical cap. 



is relatively straightforward. We opt to traverse the list of CSMPs and find all 
other regions possessing a thresholded number of shared border pixels. We find 
that a minimum of 5 pixels yields a detailed but manageable RAG, typically 
with around 10-20 adjacent regions adjoining the largest CSMPs, reducing in a 
well-behaved fashion to 1 or 2 adjacencies for smallest regions, although some 
small regions tend to be isolated by this criterion. An example of the regions 
extracted from the toy-duck image are shown in n Figure |SI 



5 Recognition Strategy 

Having derived a structural representation of surface topography delivered by 
shape-from-shading, it remains to match these representations in order to achieve 



Region-Based Object Recognition Using Shape-from-Shading 



463 








Fig. 3. Several objects are shown as they rotate by 10° intervals. The labeling re- 
mains consistent through the rotations. The label colours are: green=spherical cup, 
cyan=trough, blue=cylindrical ruts, pale blue=saddles ruts, white=symmetrical sadd- 
les, background and planar regions, pale yellow=saddle ridge, yellow=cylindrical ridge, 
orange=dome and red=spherical cap. Clearly, not all objects feature all types of label, 
and indeed, spherical cups appear to be particularly rare. 




Fig. 4. A simple comparison of image structure in terms of CSMP sizes and shape- 
index labels. The 25 largest CSMPs of 35 images of the toy duck are sorted in order 
of size. Each vertical bar shows the relative sizes of these regions, coloured according 
to their associated shape-index label. There is considerable correlation between neigh- 
bouring bars in terms of both region sizes and shape-index labels. 



object recognition. We adopt two different approaches to matching the CSMP’s 
extracted from the raw-shape index delivered by the shape-from-shading scheme. 
The first of these is set-based and uses various attributes for the CSMP’s. This 
first approach does not use any information concerning relational arrangement 
or graph-structure. The second approach is graph-based and aims to compare 
objects using information conveyed by the edge-structure for the region adja- 
cency graph of the CSMP’s. However, it must be stressed that the aim of our 
study is to show that the topographic structure of the needle-maps is useable 
for recognition. 
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Fig. 5. Frequency of topographic labels for the 20 objects in the COIL database. 



As Dorai and Jain note |S|, there are many possible approaches to matching 
on the basis of such a rich representation, and it is not our goal here to com- 
pare or assess, for example, different graph-matching strategies. Indeed, our sole 
motivation is to show the utility, of SFS for generating useful representations 
for recognition. In the subsequent sections, we detail the recognition approach 
taken in our experiments, with the important caveat that we make no claims for 
efficacy of the specific matching strategies listed here. 

We opt to treat different parts of the representation separately to provide 
matching on the basis of each, and subsequently combine the evidence from 
different parts of the representation using a majority voting approach in order 
to return an overall closest match. 

5.1 Attribute-Based Methods 

In this subsection we describe various attribute-based approaches to the set of 
CSMP’s delivered by shape-from-shading. 

Matching using CSMP Size: From Figure 0 there is clear potential for re- 
cognition simply on the basis of the ordered sequences of region sizes. We define 
the similarity between sequences as 



where and Nm are the numbers of regions in the data and model representa- 
tions respectively, and Ai is the normalized area of the CSMP with region label 
1. Clearly, discrepancies between the areas of the large regions at the start of the 
representation will have greater effect than differences between small regions. 
There is scope to compare the strings using edit distance. This would allow us 



min(AfM-l-l,JVD-|-l) 





( 7 ) 
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to deal with the situation where a large region is split, causing the region that 
was listed after it to be promoted in the representation. However, as we shall 
demonstrate later, even using this simple definition, excellent recognition results 
can be obtained on the basis of CSMP-size sequences alone (Figure 0. 



Matching using Shape-Index and Curvedness: Again considering Fig- 
ure 4, we see that the sequence of topographic labels is also characteristic for 
different objects. Hence, we define the topographic sequence similarity as 



^ 2 






where 4>i is the shape-index associated with the CSMP with label 1. 
Similarly, we calculate the curvedness sequence similarity as 



min(7VM-l-l,AfD-|-l) 



dcs — 



E 

1^2 






( 8 ) 



( 9 ) 



where ci is the curvedness associated with the CSMP with label 1. 

Including the mean area of the two regions being compared ensures that large 
regions have greater effect upon the sequence similarity than small regions. Of 
course, if the regions have identical shape-index or curvedness values, they do 
not contribute to the distance at all. However, since we use a discrete scale for 
the shape-index values but a continuous scale for the curvedness, identical values 
for regions are only likely to occur in the case of the shape-index. 



Matching using Mean Surface Normal Directions: We match the mean 
surface normals of regions using the measure 



d.sn — 



m in ( Af M -I- 1 , JVd -I- 1 ) 

E 

/=2 



iMM) + Ai{D)) 



\\hi{M) - hi{D)\\ 



( 10 ) 



6 RAG Comparison 

In this section we turn our attention to the matching of the region adjacency 
graph for the CSMP’s. This is the most complex part of the representation, 
and is therefore the most difficult and expensive part to match. Many graph- 
matching methodologies have been reported in the literature (e.g. 1231), and it 
is not our intention here to investigate these. However, most of the reported 
methods are tailored to the problem of finding a detailed pattern of correspon- 
dences between pairs of graphs. They are hence not concerned with finding the 
graph from a large data-base which most closely resembles the query. Recently, 
Huet and Hancock have reported a framework for measuring the simila- 

rity of attributed relational graphs for object recognition from large structural 
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libraries. The method uses a variant of the Hausdorff distance as an efficiently 
computed and simple measure of graph similarity. 

The idea underpinning the Hausdorff distance, as used by Rucklidge \2 \ I I 4) to 
locate objects in images, is to compute the distance between two unordered ob- 
servations when the correspondences between the individual items are unknown. 
This is done by computing the maximum value of the minimum Euclidean di- 
stance over the space of all pairwise model-data correspondences. However, the 
distance proves to be susceptible to noise. 

Huet and Hancock’s m idea is to use the apparatus of robust statistics to 
develop a robust variant of the Hausdorff distance that can be used to measure 
the similarity of attributed relational graphs. Suppose that G"* = 
and G‘^ = of represent a model-graph and a data-graph that are to be 

compared, where V‘^ and E™ are the sets of nodes of the data and model graphs 
respectively, and and E’” are the corresponding sets of graph edges or arcs. 
. The similarity measure compares the attributes on the edges of the two graphs 
using a robust error kernel Pa{-)- The measure is defined to be 



i7’'(G'^,G™) 



E 



< min 



P<7 



( 




( 11 ) 



where is the vector of measurements associated with the graph edge 

(I, J) G E™ linking node / G E™ to node J G E™ in the model graph. Likewise, 
is the measurement vector corresponding to graph edge (i,J) ^ 

Suppose that there are several data-graphs which can be matched to the 
model and that these graphs have index-set D. The best-matching graph has the 
minimal Hausdorff distance over the set of stored models and has class identity 



0™ = a.rgmin H^{Gd,Gm) 



( 12 ) 



It remains to define the attribute vectors associated with each edge of the 
region adjacency graphs. Once again, a wide range of attributes are possible. 
For the sake of simplicity, we choose to assign the normalized region size to 
each node. Hence, the attribute vector of each arc in the data graph becomes 
j) ~ (^*’^i)^> where Ai and Aj are the normalized region areas represented 
by nodes i and j respectively. Similar attribute vectors are defined for the model 
graphs, G^. 



7 Experiments 

In this section we experiment with the five recognition strategies described in 
previous section of this paper. We use the COIL data-base from Columbia Uni- 
versity in our experiments. This consists of 72 views of 20 objects. The views 
are regularly positioned around the equator of each object. In other words the 
viewing angle is incremented by 5 degrees between successive views. There are 
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a variety of objects in the data-base. Since many of the objects have variable 
albedo and are not matte, they violate the assumptions concerning Lambertian 
reflectance. 

Our experimental protocol was as follows. We take each image in turn from 
the database. This is taken as a query image and is removed from the data- 
base. We then And the closest matching remaining image in the database. The 
criterion for the closest match is the smallest distance. If this match is to one 
of the two neighbouring views of the query image, then the match is deemed to 
have been a success. We measure the recognition performance by the fraction of 
queries that return a successful match. 

7.1 Individual Recognition Performance 

We commence our study by comparing the performance of the different recogni- 
tion schemes in isolation of one-another. 

The results of our experiments are summarised in Figure 0 This compares 
the recognition accuracy for each of the five methods outlined in the previous 
Section taken in isolation. The best performance is obtained using the curvedness 
sequence (98%) and the region size sequence (97%). The graph-matching method 
gives a recognition rate of 84%. The shape-index sequence (60%) and the surface 
normals (59%) give rather disappointing results. These latter two performance 
figures should be compared with the result obtained if structural information 
is altogether ignored. If the images are matched on the basis of the relative 
frequencies of the topographic classes alone, then a recognition rate of 72% is 
achieved. 

The fact that the shape-index sequence performs rather poorly may be at- 
tributed to the fact that it provides little additional information. The reason for 
this is that the CSMP segmentation is itself derived from shape-index. Moreo- 
ver, the curvedness and the size of the CSMP’s are also strongly correlated to 
one-another. Since curvature is inversely proportional to radius, highly curved 
objects are likely to present a small area. Finally it is disappointing that the 
graph-matching performs less well than the curvedness sequence. However, it 
is important to note that the sequence is a relational structure. It is a string, 
where the adjacency relation is the size ordering of CSMP’s. The region ad- 
jacency graph, on the other hand, uses spatial adjacency as the predicating 
relation. Hence, the results may be more indicative that feature contrast may be 
more important than spatial organisation. This is certainly in tune with work in 
the psychology literature including that of Tversky EH]' 

7.2 Combining Evidence 

Having considered the individual performance of each of our recognition schemes 
in turn, we now consider how to combine evidence from different components of 
the overall representation. We use a simple majority voting procedure. For each 
recognition scheme in turn we record the identities of the ten best matches. These 
ten matches each represent a vote that can be cast by a single recognition scheme 
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for the different objects in the data-base. The different recognition schemes are 
then polled by tallying their votes for the different objects in the data-base. The 
object that receives the greatest number of votes is the winner. So, the maximum 
number of votes that a single object can receive is lOe, where e is the number 
of recognition algorithms being polled. There are clearly many ways of polling 
committees of experts and this remains an active topic of research. Suffice to say 
that the aim here is to investigate whether polling can improve the recognition 
performance significantly. 

Figure 0 shows the recognition results obtained using this simple majority 
voting approach to combining evidence. We achieve better than 90% recognition 
by considering the first three matches. The results may be slightly improved, as 
Figure 0illustrates, by using only the best two components of the representation 
as determined from Figure |3 




Fig. 6. The toy duck image is split into labeled regions. The 6 largest regions have 
had false colour added to clarify the image. These regions are ranked, in descending 
order of size, blue, red, green, yellow, magenta and cyan. They correspond to the first 
6 rows of the representation described in the text. Note that the labeling begins with 
label number 2 (corresponding to the blue region) since label 1 denotes the background 
and is not used in the adjacency calculations. 



8 Conclusions 

We have investigated the feasibility of 3D object recognition from 2D images 
using topographic information derived from shape-from-shading. In particular, 
we have concentrated on the structural abstraction representation of this infor- 
mation using the CSMP representation of Dorai and Jain 0. This representation 
has been demonstrated as an effective tool in the recognition of 3D objects from 
range images. 

We have investigated the use of various attributes and relational structures 
computed from the CSMP’s. The most effective of these is a string of curvedness 
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Fig. 7. Comparison of the recognition performance obtained using each component of 
the structural representation in isolation. The percentage of correct recognitions in the 
first 5 matches is shown using (1) the shape-index sequence, (2) curvedness sequence, 
(3) region size sequence, (4) mean region normals, and (5) the region adjacency graph. 




Fig. 8. Recognition performance in terms of correct matches when taking the 1st n 
matches. From left to right, we consider the first match only, the hrst two matches, 
and so on up to the first five matches. The recognition is performed using all five 
components of the overall representation, combined using simple majority voting. 



attributes ordered according to CSPM size. However, we have also obtained 
useful results by matching the region adjacency graph for the CSMP’s. 

Based on these results, there is clearly a great deal of research that can be 
undertaken with the aim of improving recognition performance. In particular, 
we intend to pursue the use of additional region shape information. However, it 
may prove that the most important contribution of SFS-derived representations 
is nothing to do with regions. The parabolic lines recovered in the shape-index la- 
belings have potential, according to psychophysical observations by Koenderinck 
0BI, as a sparse object representation for object recognition. 
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Fig. 9. Recognition performance in terms of correct matches when taking the 1st n 
matches, using only the best two components of the representation (curvedness and 
region size). The overall recognition rate is slightly improved over Figure 0 
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Abstract. We present a method for recognition of walking people in 
monocular image sequences based on extraction of coordinates of spe- 
cific point locations on the body. The method works by comparison of 
sequences of recorded coordinates with a library of sequences from diffe- 
rent individuals. The comparison is based on the evaluation of view in- 
variant and calibration independent view consistency constraints. These 
constraints are functions of corresponding image coordinates in two views 
and are satished whenever the two views are projected from the same 
3D object. By evaluating the view consistency constraints for each pair 
of frames in a sequence of a walking person and a stored sequence we get 
a matrix of consistency values that ideally are zero whenever the pair of 
images depict the same 3D-posture. The method is virtually parameter 
free and computes a consistency residual between a pair of sequences that 
can be used as a distance for clustering and classification. Using interac- 
tively extracted data we present experimental results that are superior 
to those of previously published algorithms both in terms of performance 
and generality. 



Keywords: structure from motion, calibration, object recognition 

1 Introduction 

Visual analysis of human motion is an area with a large potential for applications. 
These include medical analysis, automatic user interfaces, content based video 
analysis etc. It has therefore received an increased amount of attention during 
recent years P.inil- A dominant part of the work in the field has been concerned 
with automatic tracking using shape models in 2D or 3D. 0, 0, m, PI: IP, 
IP with the main application of classifying human action in mind. For this, a 
3D model would be invaluable which in general requires multiple views for the 
analysis. The important case of using just a single view makes the problem of 3D 
model acquisition far more difficult and has so far met with only limited success 
m This difficulty of automatic 3D analysis from a single sequence stands in 
sharp contrast to the ease at which we can perceptually interpret human motion 
and action, even from very impoverished monocular stimuli as moving light 
displays m although the ability to recognize a specific individual from such 
displays is far from perfect |2|. 



D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 472-g^ 2000. 
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The purpose of the work presented in this paper is the identification of wal- 
king people from a single monocular video sequence acquired from an arbitrary 
relative viewpoint. In order for the identification to be insensitive to changes in 
viewpoint and in order to overcome the perspective effects that will be present 
during a walking cycle, we should ideally make use of 3D properties of human 
gait such as time sequences of relative angles of limbs etc. which as was pointed 
out is a very difficult problem. All work so far in automatic person identifica- 
tion from image sequences have therefore restricted the motion to frontoparallel 
walking relative to the camera. H31, HH. 

Since 3D model acquisition from a single view is notoriously difficult and 
unstable it will require strong model assumptions about the movement in 3D in 
order to get a stable solution. Making strong assumptions about the movement 
in 3D is however highly undesirable if the application is identification since the 
model assumptions will wipe out the individual differences between the different 
persons that are to be identified. 

In opposition to this we will allow for very general relative viewpoint by 
basing the recognition implicitly on the 3D structure and motion of the walking 
person without performing any explicit 3D reconstruction. This will be done 
by exploiting view consistency constraints that exists between two views of the 
same 3D structure. Given two image point sets, the view consistency constraints 
are functions of the coordinates of the points in the two images and they are 
satisfied if the two images are the projection of the same 3D point set. They 
therefore answer the question: “can these two views be the projection of the 
same 3D structure ?” 

In the following sections we will present the geometric interpretation and 
algebraic derivation of VC-constraints and demonstrate how they can be used 
in order to verify whether two image sequences depict the same walking person. 



2 View Consistency Constraints 

2.1 Geometric Interpretation 

Whenever we have two images of the same point set in 3D, the two image planes 
can be positioned so that the lines of sights through the points intersect at the 
coordinates of the 3D point set (fig. E A) We will refer to two such image point 
sets as being view consistent. 

Note that view consistency follows from the fact that two views are projected 
from the same 3D point set but the reverse is not true. Having two view consistent 
point sets does not imply that they are projected from the same 3D point set. 
Two different 3D point sets can align accidentally to produce two view consistent 
image point sets. The equivalence class of ambiguous 3D point sets with this 
property can be generated easily by just extending the lines of sights in the 
two images of figClB. If however we are viewing a restricted class of 3D shapes 
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Fig. 1. A). Two image point sets are view consistent if they are the projection of 
the same 3D point set. B).Two image point sets can be accidentally view consistent 
although they are not projected from the same 3D structure 



with statistically a priori defined properties, it is likely that these ambiguous 
equivalent classes of 3D point sets can be narrowed down substantially. If the 
points represent specific locations on the human body e.g. the constraints on 
human posture will lead to a restriction of the 3D shape ambiguity and the degree 
to which the view consistency constraint is satisfied will reflect the difference in 
3D shape of the point sets giving rise to the two views. The degree to which this 
is true must of course eventually be determined by experimental evaluation for 
each restricted class of shapes. 

2.2 Two View Consistency 

The fundamental constraints relating the projection of a 3D point to an image in- 
volves three kinds of parameters: Camera parameters, 3D coordinates and image 
coordinates. These are related by a projection equation, which varies with the 
kind of geometry and degree of calibration assumed for the camera. If more than 
one view is available, the 3D coordinates can be eliminated, leaving constraint 
relations between camera parameters and image coordinates in the two views. 
These are known as epipolar constraints. Alternatively, by having sufficiently 
many points in one view, the camera parameters can be eliminated leaving con- 
straints in 3D and image coordinates known as single view shape constraints. |0| , 
0, PH, PH, The process of elimination can be continued from these constraints. 
By using sufficiently many points the camera parameters in the epipolar con- 
straints can be eliminated leaving just constraints in the image coordinates of 
the two views. Alternatively by using multiple views, the 3D coordinates can be 
eliminated from the single view shape constraints, leaving identical constraints 
in the image coordinates. These image coordinate constraints reflect the fact 
that the two views are the projection of the same 3D point set and are therefore 
the algebraic expressions of the view consistency constraints. Q 

^ Since the initial submission of this paper it has come to the author’s attention that 
the specific four point constraint for the case of known scale factors was actually 
first derived in |2] and discussed for use in recognition applications in 0 
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Fig. 2. Derivation of view consistency constraints by elimination involving camera 
parameters: F, 3D coordinates: P and image coordinates: q 

In app. 1 it is shown that for orthographic projection cameras with unknown 
scale factors, four corresponding points with image coordinates and in the 
two views respectively satisfy the polynomial consistency constraint 



OL ( [ ^3 A4 ] + [ A2 -B3 A4 ] + [ A2 A3 B4 ] ) — 

- /3 {[B2 B3 A4] + [B2 A3 B4] + [A2 B3 B4]) = 0 



( 1 ) 



where the determinants [ Ai Aj Bk ] are defined as: 



[ Ai Aj B}^ ] — 



< 02 > 0,j 
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02 > < bk b2 > 

«3 > <bk b3 > 

04 > < bk b4 > 



(2) 



with Qi = Pi — Pi and bi = p\ — p\ i = 2 . . . A and a and j3 are the unknown scale 
factors squared of the two cameras. By taking a fifth point we can eliminate 
the ratio a/ (3 to get a view consistency constraint polynomial in five points, as 
shown in app. 1. 

In general we will have more than five points available however. It is there- 
fore more effective to eliminate a and /3 using a regression procedure, based on 
four point constraint polynomials of type D These polynomials will not all be 
independent. For our application we interactively select 8 points of the image of 
a walking person, fig El 

All combinations of four points give a polynomial constraint acc. to eqE 



oi Pi{iJ,k,l) - P P2{i,j,k,l) = 0 



( 3 ) 



It can be shown theoretically and was verified experimentally that choosing 
point sets with 3 points collinear or close to, will result in numerical instability. 
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Fig. 3. Interactively selected point locations on human body 



In order to avoid as far as possible having triplets of 3 collinear points we choose 
points 1 2 3 in all combinations and vary the fourth points among the set 4, 5, 
6, 7, 8. We then get the set of constraint polynomials: 

a Pi(l,2,3,i) -/3 P2(l,2,3,i) = 0 i = 4,5,6,7,8 (4) 

The values of a and (3 are found by regression: 

8 

min V (aPi(l,2,3,i)-/3P2(l,2,3,i) )" (5) 

t—A 

and the regression residual using optimal values of a and (3 

8 

R{a, /3) = ^ (d Pi(l, 2, 3, t) - /3 P2(l, 2, 3, i) (6) 

i=4 

is used as a measure of the degree of consistency of the two views. This value 
is zero if the two views are noise free orthographic projections of the same 3D 
point set. If the views are projections of different point sets, this value will in 
general deviate from zero. In general we can expect this value to measure the 
degree of 3D similarity of two point sets. Especially when we are considering a 
restricted class of 3D shapes given by human posture. 

2.3 Sequence Consistency 

The two view consistency constraint residual is the basis for the algorithm for 
recognition of walking people. For this we will use sequences of one walking 
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cycle from different individuals. With normal walking speed this means about 
27 - 29 frames using 25 Hz frame rate. By using all frames in the sequence we 
will get a statistically more robust identification compared to just two views. 
The procedure for computing a measure of consistency of two sequences is as 
follows: 



1. The image coordinates of 8 specific body locations are extracted in- 
teractively from each frame in the two sequences a and b. 

2. For each pair of frames , one from sequence a and the other from 
sequence b we compute the regression residual R{a, $) of eq 0 If the 
sequences depict the same person, we will get low values for every pair 
of frame with the same 3D-posture. Depending on the synchronization 
and relative walking speed in the two sequences low values will show up 
along a parallel displaced and tilted diagonal line in the view consistency 
matrix composed of all the pairwise regression residuals. 

3. Regression residuals are averaged along lines in the view consistency 
matrix with various starting and stopping pairs marked in fig 0 The 
minimum average value will be referred to as the sequence consistency 
residual. 
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Fig. 4. Depending on syncronisation and relative walking speed in two sequences a and 
b, minimum sequence consistency residual values will appear along a tilted diagonal 
line in the VCR-matrix. Minimum average value of the VCR’s is sought for among the 
lines with start and stop positions in the marked areas. 



Fig 0 shows examples of 16 view consistency matrices for four sequences 
compared with each other. The first two Ka-1 and Kb-1 depict person K from 
different viewpoints and the second two depict person M, also from different 
viewpoints. The first and the last frames of the sequences can be found in fig 
rm The sampling of the frames in the sequences was chosen in order to get 
approximate synchrony of the walking cycles although this is not critical but 
helps the visualization of the view consistency matrices. From fig 0 we note that 
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Ka-1 Kb-1 Ma-1 Mb-1 



Ka-1 



Kb-1 



Ma-1 



Mb-1 




Fig. 5. View consistency residual (VCR)-matrices for sequences Ka Kb Ma Mb com- 
pared with each other. Dark = low residuals, light = high residuals 

we get low residual values along and close to the diagonal of the concatenated 
view consistency matrices. The values along the diagonal is of course zero since 
exactly the same sequence with exactly the same frames are compared there. 
The width of the low values close to the diagonal varies depending on the speed 
of motion during the walking cycle. The narrowest parts can be seen in the 
phase of the motion of the left leg which gives the fastest variation of the 3D 
posture. We see that we also get low residual values of the matrices depicting 
the same person but different walking sequences (Ka-1 Kb-1) and (Ma-1 Mb-1) 
but substantially higher residual values for the matrices where different persons 
are compared. 

3 Experimental Results and Conclusions 

3.1 Recording of Walking Sequences 

In order to test whether the sequence consistency measure defined in the previous 
chapter can be used for identification , we recorded sequences of walking people 
with varying direction of motion relative to the camera line of sight (fig. 0. ) 
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Fig. 6. Recorded segments of one walking cycle in various relative directions 

The attempt was made to record walking directions with angles 30, and 60 
degrees relative to the image plane. They are denoted as a and b respectively in 
fig. El Sequences c and d are about 80 degrees relative walking direction. Every 
segment in fig. El represents one walking cycle. Two consecutive walking cycles 
were extracted when possible. These are denoted as al, a2 , bl b2 and cl, c2 
respectively. Six different persons , denoted A, T, C, D, K and M were recorded. 
Sequences of one walking cycle were extracted from the recorded material with 
an attempt to choose temporal syncronisation for purposes of display. Fig [nil 
shows the first and last frames of the sequences al and bl for four different 
persons. The choice of syncronisation of the sequences together with the fact 
that walking directions were somewhat approximate relative to the attempted 
means that there is a slight deviation from the ideal geometry of fig. El The two 
persons A and T were recorded at another occasion in the directions c and d. 

For each recorded sequence, 8 points according to fig. El were selected inter- 
actively. The view angle of the camera was around 45 deg. and figcni shows the 
actual frames recorded. 



3.2 Sequence Consistency Residuals 

For the approximate directions of 30 and 60 degrees relative to the image plane, 
denoted a and b respectively, we were able to extract 14 different sequences 
of one walking cycle from persons C, D, K and M and for directions c and d, 
6 sequences from persons A and T. They were all compared pairwise and the 
sequence consistency residuals acc. to section 2.2 was computed. These are shown 
in the table of fig 0 



The table shows that sequence consistency residuals are in general substan- 
tially lower for sequence pairs of the same person compared to sequence pairs of 
different persons. The sequences of a certain person can therefore be visualized 
as clusters by the method known as multidimensional scaling where we try to 
plot all sequences as points in a 3D-space in order that the interpoint distances 
should equal the sequence consistency residuals. This is in general not possible 
to do consistently in an exact way. Fig E| shows a plot of this for all walking 
segments of all the six persons in the experiments which is conservative in the 
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Fig. 7. Sequence consistency residuals for all sequence pairs 



sense that the lowest sequence consistency residuals, those < 10, e.g. all residuals 
corresponding to the same person are reproduced more or less exactly while the 
large residuals are in general under estimated in the plot. 

From this plot we see that walking sequences from the same person in various 
different walking directions can be grouped into non-overlapping clusters. For 
reasons of visualization, the dimension of the plot is chosen as 3. We do not really 
know the actual intrinsic dimensionality of the data set, so the conservative 3 
capr-D plot gives in a sense the worst case projection of the data from some 
unknown high dimensional space onto a 3 dimensional space. 

3.3 Conclusions 

The performance of any algorithm for classification of data into separate classes 
depends of course on the number of classes, in our case the number of different 
individuals. It also depends on how many sequences that are stored as reference 
for each individual. Given a set of training sequences in a library, one can ask 
for the probability of a new recorded sequence to be classified correctly. We can 
test this on our material using a so called “leave one out” procedure, whereby 
one sequence at a time is considered as the sequence to be classified and the 
remaining sequences are all considered to be reference sequences. We can then 
use a nearest neighbor classifier based on computing the average distance to all 
reference sequences from a certain person. If this is done we get the result of the 
table in fig 13 

From this table we see that of 20 sequences, 19 are classified correctly using 
a simple minimum distance classifier. The only exception is sequence C-bl being 
classified as T. This is of course a promising result but should be seen in the 
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Fig. 8. 3D multidimensional scaling plot of sequence consistency residuals. The inter- 
point distances in 3D correspond roughly to the sequence consistency residuals of fig 

7. 



light of the fact that we only have 20 sequences from 6 persons. Increasing the 
number of persons would of course increase the chance of misclassification. 

It is interesting to note in comparison that similar procedures for computing 
recognition rates led to 81 % in ^3] for 26 sequences of 5 persons and around 
90 % in U2I for 42 sequences of 6 persons. Both these cases used automatic 
data extraction but were restricted to frontoparallel walking however while our 
algorithm , using interactive feature extraction, performs over a wide range of 
relative walking directions. Using view consistency constraints for identification 
of walking people, we therefore can claim superior results and more generality 
at the price of using interactive selection of feature points. 
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Fig. 9. Average sequence consistency residuals over different individuals for all sequen- 



ces 
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In its present form the we believe that the algorithm can be useful in e.g. 
forensic science applications where the problem often is to classify a single se- 
quence and the library is limited. 
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4 Appendix 

4.1 View Consistency Constraints - Scaled Orthographic Projection 

For most real world cameras we can assume to have orthographic projection 
and square pixels, i.e. the same scale factor in x and y. This only unknown 
internal camera parameter is then a scale factor a. The projection equation for 
an arbitrary orthogonal image coordinate system for this camera can be written 
as: 



X = a rf P + xq 
y = a r'^P + yo 



(7) 



where ri and X 2 are the first two rows of an arbitrary 3x3 rotation matrix. We 
will now derive the view consistency constraints for two cameras of this kind by 
elimination of all camera and 3D shape parameters. Note that two having views 
does not permit the explicit determination of relative camera rotation but this 
will be of no concern for the view consistency constraints since rotation is to be 
eliminated anyway. 

The unit vectors of the orthonormal rotation matrix ri , r 2 and can be 
used to expand the vector Pi — Pi Introducing the unknown parameter 7 ^ = 
{Pi — Pi) and taking differences to eliminate the constants xo,yo, we get: 

cr“^(a:i - xi) =r1{Pi - Pi) 

(^~^{Vi - yi) ='rl{Pi - Pi) (8) 

li ='rl{Pi - Pi) 

Using the orthonormality of the matrix (ri T 2 r^) we get: 

Pi - Pi = <J~^{xi - xi) Ti + a~^{y^ - yi) X 2 + 7i (9) 



By taking inner products we can eliminate the rotation matrix: 
< (A - Pi) {P, - Pi)> = 

= a~‘^{xi - a;i) {xj - 



( 10 ) 



Xl) -k -k CT 2(y. _ y^) {y^ - y^) + 
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Using image data from two views a and b we can eliminate the 3D coordinates 
and get: 

{xf - X?) (x“ - X?) + {yf - y 1 ) - y“) + 7“ 7“ = 

(x^ - x\) (x5 - x\) + {y\ - y\) - y\) + 7'’ 7) 

which we write as: 

P( i.A - x5) (xj - x\) + {y\ - y\) {y] - y\) ) - 
- (x“ - X?) (x“ - X?) + (j/f - y?) (y“ - y?) = (11) 

= if if - if if ) 
where we have used p = {(Jal<JbY 

In order to get a more compact notation we write the inner products as: 

< a, a, > = (x“ - X?) (x“ - X?) + (j/f - y?) (y“ - yj) 

(12) 

< 6. b, > = (xj - X?) (x5 - X?) + {y\ - y\) (yf - y\) ) 



at =a^lf A =^“ 7 ? (13) 

Using this we get: Using this we get: 

p < bi bj > — < Gi Gj > = aiUj — PiPj (14) 

where i and j ranges over 2 . . . n where n is the number of points. By using 
sufficiently many points we will now eliminate the unknowns, a, ai and A to 
get view consistency constraints expressed in terms of the image coordinates< 
Gi Gj >,< bi bj > of the two views only. 

Consider first the case of having the same scale factor in both views, cr“ = 
cr^ ==> p = 1 and take four points. We then have to eliminate six unknown 
parameters ai,a 2 ,o; 3 j A> A) A- In general we can write: 




Taking inner products of both sides of this equation with vectors ( 02 ,— A) > 
(0:3, —A)) and («4, —A) we get the linear system of equations: 

040:2 — AA = Qi (0:20:2 — A A) + Q2 (03O2 — AA) 

0403 — AA = 9i (0:20:3 — A A) + 92 (0303 — AA) ( 16 ) 

0404 — AA = 9i (0204 — A A) + 92 (0304 — AA) 
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Substituting the a and [3 expressions using eq. d we get: 

< 0402 > — < 64^2 >= <7i(< 02O2 > — < 62^2 >) + Q2(< 0302 > — < b^b 2 >) 

< 0403 > - < bibs >= gi(< 02O3 > - < 6263 >) + Q2(< 0303 > - < 6363 >) 

< 0404 > — < bibi >= qi{< O2O4 > — < 62^4 >) + Q2(< 0304 > — < 6364 >) 



This system is singular which means that the system determinant vanishes: 



< O2O2 > — < &2^2 > < O3O2 > — < 6362 > 

< 02O3 > - < 62^3 > < 0303 > - < 6363 > 

< 02O4 > — < 62^4 > < 03O4 > — < bsbi > 



< O4O2 > — < &4^2 > 

< O4O3 > — < 6463 > 

< 0404 > — < bibi > 



= 0 



which is the view consistency constraint on four corresponding points in two 
views for calibrated orthographic projection cameras. 

For the case of arbitrary unknown scale factors in the two views we have for 
four points: 



< 02O2 > -p < 62^2 > < 0302 > -p < 6362 > 

< 02O3 > —p < 6263 > < 03O3 > —p < 6363 > 

< 02O4 > —p < 6264 > < 03O4 > —p < 6364 > 



< 04O2 > —p < 6462 > 

< 04O3 > —p < bibs > 

< 0404 > —p < bibi > 



= 0 



If this determinant is developed we get a polynomial in p. Denoting the 
determinants: 



< Oi 02 > 

< Qi as> 



< CLj O2 > 

< Oj 03 > 
04 > 



< 62 > 

<bk bs> 
<bkbi> 



— [ Ai Aj Bh 



we get: 

[ A2 As Ai \ — p {[ B2 As Ai \ + [ ^2 i?3 ^4 ] + [ ^2 ^3 



Ba]) 



3 - p^ {[ B2 Bs Ai \ + [ i?2 ^3 ^4 ] + [ A2 Bs Bi\) — [ B2 Bs Bi] 

However, it is simple to show that: 

[ A2 As Ai ] = [ i?2 i?3 i?4 ] =0 



( 17 ) 



( 18 ) 
= 0 

( 19 ) 



The polynomial in p therefore reduces to: 

[ B2 As Ai \ +[712733^4] +[^2^354] — 

~ P B 2 Bs Ai ] \ B 2 As Bi] A 2 Bs Bi ]) = 0 



( 20 ) 



In order to eliminate the unknown scale factor ratio p we need a fifth point. 
We then get the system constraint: 



[H2A3H4] + [A2B3H4] + [^2^.3734] [732733^4] + [732H3734] + [H2733734] 



[732A3H5] + [H2733H5] + [H2H.3735] [732733H5] + [732H3735] + [H2733735] 



= 0 



which is a view consistency constraint for five points in two views of unknown 
scaled orthographic projection cameras 
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Abstract. This paper presents a new technique for the perception and 
recognition of activities using statistical descriptions of their spatio- 
temporal properties. A set of motion energy receptive fields is designed 
in order to sample the power spectrum of a moving texture. Their struc- 
ture relates to the spatio-temporal energy models of Adelson and Bergen 
where measures of local visual motion information are extracted by com- 
paring the outputs of a triad of Gabor energy filters. Then the probabi- 
lity density function required for Bayes rule is estimated for each class 
of activity by computing multi-dimensional histograms from the outputs 
from the set of receptive fields. The perception of activities is achieved 
according to Bayes rule. The result at each instant of time is the map of 
the conditional probabilities that each pixel belongs to each one of the 
activities of the training set. Since activities are perceived over a short 
integration time, a temporal analysis of outputs is done using Hidden 
Markov Models. 

The approach is validated with experiments in the perception and re- 
cognition of activities of people walking in visual surveillance scenari. 
The presented work is in progress and preliminary results are encou- 
raging, since recognition is robust to variations in illumination conditi- 
ons, to partial occlusions and to changes in texture. It is shown that it 
constitute a powerful early vision tool for human behaviors analysis for 
smart-environnements. 



1 Introduction 

The use of computer vision for recognition of activities has many potential ap- 
plications in man-machine interaction, inter-personal communication and visual 
surveillance. Considering several classes of body actions, the machine would be 
able to react to some command gestures. Such techniques support applications 
such as video-conferencing, tele-teaching and virtual reality environments, where 
the user is not confined to the desktop but is able to move around freely. The 
aim of the research described in this paper is the characterization for recognition 
of human actions such as gestures, or full body movements. 

Analyzing the motion of deformable objects from image sequences is a chal- 
lenging problem for computer vision. Different approaches have been proposed 
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for this task. With the exception of algorithms whose aim is to determine the 3D 
motion of objects, two trends emerge: the techniques using 2D geometric model 
of the objects, and appearance based methods. 

In the case of human activity analysis a central problem results from the fact 
that the human body consists of body parts linked to each other. A possible ap- 
proach is to model the articulated structure of the human body. For example, in 
the system Pfinder [WADP9^ the human body is modeled as a connected set of 
blobs, each blob having a spatial and color distribution. An other example is the 
Cardboard person model of Ju, Black and Yacoob |YB98j . where human limbs 
are represented as a set of connected planar patches. Some of these techniques 
using human body models assumes that a two-dimensional reconstruction prece- 
des the recognition of action. In either case, the methods require algorithms with 
relatively high computational costs whose robustness and stability are difficult 
to analyze. 

An alternative to geometric and kinematic modeling is to employ an image 
based description that captures the appearance of the motion. Davis and Bobick 
have defined a representation of action in terms of Motion History Image 
(MHI) . A MHI is a scalar- valued image where intensity is a function of recency 
of motion. An appearance-based technique is used to match temporal templates, 
computing statistical descriptions of MHI with Hu moments. The Motion History 
Image acts as a local low pass temporal filter, and the distribution over space of 
the filter output is used for recognition. The MHI of Davis and Bobick is based 
on local motion appearance. In this case, appearance of motion is defined as the 
temporal memory of motion occurrences, but no grey level information is used. 

Another alternative is the use of object appearance based on contour ana- 
lysis or active contours. Such an approach has been used by Cootes and al. in 
PCTL+93| where flexible shape templates are fitted to data according to a sta- 
tistical model of grey level information around model points. In a more recent 
work Active Appearance Model (A.A.M.) are used, modeling the object shape 
and gray level appearance. Baumberg and Hogg mmi use a Point Distribution 
Models of the shapes of walking pedestrians. The main characteristics of the 
body shape deformations are captured by a Principal Component Analysis of 
these point sets. This approach is robust to occlusion, but it requires a backgro- 
und segmentation to allow the extraction of the boundary of the pedestrian. 

The approach described in this paper is related to the MHI of Davis and 
Bobick, as a local appearance based method. It is influenced by the work of 
Murase and Nayar fMN9,^J . where the set of appearances of objects is expressed 
as a trajectory in a principal component space. It is also inspired by Black and 
Jepson jB.T96j who extended a global P.C.A. approach to track articulated ob- 
ject in a principal component space. All of these approaches derived a space by 
performing principal component analysis (P.C.A.) on an entire image. Such glo- 
bal approaches are sensitive to partial occlusions as well as to the intensity and 
shape of background regions. These problems can be avoided by using methods 
based on local appearance [Sch97ICC98j . In such an approach, the appearance 
of neighborhoods is described with receptive fields. Schiele |Sch97| and Colin 



A Probabilistic Sensor for the Perception and the Recognition of Activities 



489 



de Verdiere define an orthonormal space for expressing local appearance. 

This space is based on Gaussian derivatives or it is computed via P.C.A. over the 
set of windows of all the images of the training data. In this space of receptive 
fields an image is modeled as a manifold. Colin de Verdiere achieved the reco- 
gnition by measuring distance between the vector of receptive fields responses 
of an observed window and the surface points from a discrete sampling of the 
manifold. Whereas Schiele has developed a statistical approach using multidi- 
mensional histograms of the responses of vectors of receptive fields. 

In this paper the appearance of human motion is described using the ap- 
pearance of small spatio-temporal neighborhoods over a set of sequences, and 
a statistic approach is used to achieved recognition of activity patterns. The 
next section of this paper deals with the approach which is used for describing 
spatio-temporal structures. A synopsis of the local visual motion information is 
obtained by signal decomposition onto a set of oriented motion energy receptive 
fields. Section El provides the description of a probabilistic framework for analy- 
zing the receptive fields responses. Multi-dimensional histograms are computed 
to characterize each class of activity. Section 0 shows results from the percep- 
tion of human activities in the context of computer assisted visual surveillance. 
Since humans are perceived as deformable moving objects, the challenge is to 
discriminate different classes of human activities. The output of the probabili- 
stic sensor are maps of the probability that each pixel belong to each one of the 
trained classes of activities. Recognition of activities elements is done by selec- 
ting best local probabilities. Since the temporal aperture window of description 
is relatively small compared to the temporal duration of activity. Hidden Mar- 
kov Models (HMM) are employed to recognize the complete activity In a sense, 
the HMM provides context. That is the purpose of section 0. The last section 
presents discussions and perspectives. 



2 Describing Spatio-Temporal Structures 

Adelson and Bergen define the appearance space of images for a given 

scene as a 7 dimensional local function, whose dimensions are viewing position, 
time instant, position in the image, and wavelength. They have given this func- 
tion the name “plenoptic function” from the Latin roots plenus, full, and opticus, 
to see. Adelson and Bergen propose to detect local changes along one or more 
plenoptic dimensions and to represent the structure of the visual information in 
a table of the detectors responses, comparing them two by two. The two dimen- 
sions of the table are simple visual detectors such as derivatives and the table 
contents are possible visual elements. Adelson and Bergen use low order derivati- 
ves operators as 2-D receptive fields to analyze the plenoptic function. However, 
the technique which they describe is restricted to derivatives of order one and 
two, and does not include measurements involving derivatives along three or 
more dimensions of the plenoptic function. It appears that the authors did not 
follow up on their idea and that little or no experimental work was published on 
this approach. 
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Nevertheless the plenoptic function provides a powerful formalism for the 
measurement of specific local structures, including spatio-temporal patterns. 
This paper employs with such framework to describe activity patterns. Activity 
patterns are characterized by describing their local visual information using a set 
of spatio-temporal receptive fields, and by statistically modeling the descriptors 
responses. The result is a software sensor able to discriminate different patterns 
of activities. 



2.1 Using Receptive Fields 

The notion of receptive field in vision is stemed from studies on the description 
of cortex visual cells. Those studies attempt to understand the biological visual 
system to reach its performance for extracting local information measures. 

Classically receptive fields structure relates from signal decomposition tech- 
niques. The two most widely used approaches for signal decomposition are the 
Taylor expansion and the Fourier transform The Taylor series expansion gives 
a local signal description in the spatial dimension, while the Fourier transform 
provides a description in the spectral domain. These two methods for signal 
decomposition correspond respectively to the projection of the signal onto a ba- 
sis of functions with amplitude modulation and onto a basis of functions which 
are frequency modulated. Other local decomposition bases are also possible. A 
decomposition basis is generally chosen to suit the problem to be solved. For 
example, a frequency-based analysis is more suitable for texture analysis, or a 
fractal-based description for natural scene analysis. Independently from the basis 
choice, the description is done over an estimation support relative to the loca- 
lity of the analysis. The next section formulates the derivative operator of the 
Taylor expansion and the spectral operator of the Fourier transform as generic 
operators. 



2.2 Generic Neighborhood Operators 

The concept of linear neighborhood operators was redefined by Koenderink and 
Doom as generic neighborhood operators. Typically operators are re- 

quired at different scales corresponding to different sizes of estimation support. 
Authors have motivated their method by rewriting neighborhood operators as 
the product of an aperture function, A(p, cr), and a scale equivariant function, 

G{p) = A{p,a)(t){pla) (1) 

The aperture function takes a local estimation at location p of the plenoptic 
function which is a weighted average over a support proportional to its scale 
parameter, a. An aperture function is the Gaussian kernel as it satisfies the 
diffusion equation: 

_ 1 pp 

e 2 



A (p, a) 



( 2 ) 
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The function cj)(p/a) is a specific point operator relative to the decomposition 
basis. In the case of the Taylor expansion (j) {p/cr) is the Hermite polynomials: 

<j) (p/cr) = (-1)" Hen (p/cr) (3) 

In the case of the Fourier series (j){p/a) are the complex frequency modulation 
functions tuned to selected frequencies, v>: 

,/(p/a) = (4) 

Within the context of spatial, respectively spectral, signal decomposition 
the generic neighborhood operators are scale normalized Gaussian derivatives 
pLin98| . and respectively scale normalized Gabor filters. 

2.3 Motion Energy Receptive Fields 

The perception of activities involves extraction of local visual motion informa- 
tion. Techniques which reconstruct explicitly the optical flow are often complex 
and specific to the analyzed scene all the more so since that there are not well 
suited for describing the motion of moving deformable objects. The extraction of 
low level motion information involves the use of a decomposition basis sensitive 
to motion like signal decomposition using combination of Gaussian derivatives 
or Gabor filters. 

A measure of motion information rich enough to describe activities is ea- 
sily obtained in the spectral domain, since an energy measure depends on both 
the velocity and the contrast of the input signal at a given spatio-temporal fre- 
quency. Gonsider a space-time image, I (p), and its Fourier Transform, / (q), 
with p = (x,y,t) and q = (u,v,w). Let and Vy be respectively the speed 

of horizontal and vertical motion. The Fourier transform of the moving image, 
I {x — Txt, y — Tyt, t), is I (u, v,w + TxU + Tyv). This means that spatial frequen- 
cies are not changed, but all the temporal frequencies are shifted by minus the 
product of the speed and the spatial frequencies. A set of Gabor based motion 
energy receptive fields is used to sample the power spectrum of the moving tex- 
ture. Their structure relates to the spatio-temporal energy models of Adelson 
and Bergen FMq . and Heeler Motion energy measures are computed 

from the sum of the square of even (Geven) and odd-symmetric (Godd) oriented 
spatio-temporal Gabor filters which have been tuned for the same orientation, 
thus in order to be phase independent: 

H (P) = (/ (P) * Gevenf + (/ (p) * Goddf (5) 

Adelson and Bergen |AB85] suggested that these energy outputs should be com- 
bined in opponent fashion, subtracting the output of a mechanism tuned for 
leftward motion from one tuned for rightward motion. The output of such filters 
depends on both the velocity and the local spatial-content of the input signal, 
/ (p). The extraction of velocity information within a spatial frequency band 
involves normalizing the energy of the filter outputs according to the response 
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of a static energy filter tuned to the same spatial orientation and null temporal 
orientation: 

w (p) = (P) - (p) 

^Static \P) 

A triad of rightward, leftward and static Gabor energy filters is shown in part 




(a) (b) (c) 



Fig. 1. Bandwidths of spatio-temporal receptive field triads. Figure (a) represents re- 
sponses for rightward (R), leftward (L) and static (S) units for a given spatial band 
in the frequency domain {u, w) where u are the spatial frequencies and w the temporal 
ones. Figure (b) is a map of the spatial bandwidths of a set of 12 motion energy recep- 
tive fields in the spatial frequency domain (u, v) . There is 4 different orientations and 
3 different scales. And figure (c) is a 3D view of a set 4 motion energy receptive fields 
corresponding to 4 orientations and 1 scale. 



(a) of figure n. Such a spatio-temporal energy model allows the measurement of 
low level visual motion information. A set of 12 motion energy receptive fields 
are used, corresponding to 4 spatial orientations and 3 ranges of motions. This 
set of motion energy receptive fields allows the description of the spatio-temporal 
appearance of activity. 

Note that the optical flow is not reconstructed explicitly but a vector of 
measures, w{p), is obtained, where the elements Wi{p) of w (p) are motion 
energy measures tuned for different sub-bands. The combination of the 12 mo- 
tion energy receptive fields can lead to a motion estimate. Heeger jHeeSSj use 
a numerical optimization procedure to find the plane that best accounted for 
the measurements (the error criterion is least-squares regression on the filter 
energies). Spinei and al. [SPH98j make the response of a triad of Gabor energy 
filters w (p) proportional to motion using a non-linear combination of the res- 
ponse of the Gabor filters. Than he merges the estimated motion components 
corresponding to different orientations and scales. But we insist on that optical 
flow estimation is not the purpose of the proposed approach since we are motiva- 
ted by signal decomposition. Low level motion information is extracted using a 
set of motion energy receptive fields based on Gabor energy filters. The outputs 
from the set of receptive fields provide a vector of measurements, w (p) giving 
a synopsis of the local visual motion information. 
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3 Probabilistic Analysis of Feature Space 

The outputs from the set of motion energy receptive fields provide a vector of 
measurements, w (p), at each pixel. The joint statistics of these vectors allow the 
probabilistic perception of activity. A multi-dimensional histogram is computed 
from the outputs of the filter bank for each class of activity. These histograms can 
be seen as a form of activity signature and provide an estimate of the probability 
density function for use with Bayes rule. 



3.1 Measurements Probability Density 

For each class of activity ak, a multi-dimensional histogram of vectors of mea- 
surements is computed. The histogram is an estimate of the density probability 
p{w\ak) of action ak- The subspace of receptive fields presents a large number 
of dimensions, which is 12D in the case of the basis of motion energy receptive 
fields defined previously. The main problem is the computation of an histogram 
over such a large space. 

An extension of the quad-tree technique is used to represent the histograms. 
Let be N the number of dimensions (e.d. number of motion energy receptive 
fields). A dichotomic tree is designed where each node expects 2^ potential 
branches corresponding to filled cells. Cells are sub-divided by 2 along each 
dimension. Among the 2^ resulting new cells, the filled cells are sub-divided 
themselves until the final resolution. 

This algorithm allows the computation and the storage of high dimensional 
histograms which are quite sparse. 



3.2 Probabilistic Perception of Activities 



The probabilistic perception of action, ak, is achieved considering the vector 
of local measures, w (p), whose elements i are motion energy measures, Wi (p), 
tuned for different sub-bands. The probability, p (afc|tc), that the pixel p belongs 
to action ak according to w (p) is computed using Bayes rule: 



p{w\ak)p{ak) p{w\ak)p{ak) 

'■'“'■I”* = —^) = E,p(»I«Op(«0 



(7) 



where p{ak) is the a priori probability of action ak, p{w) is the a priori pro- 
bability of the vector of local measures w, and p(w\ak) the probability density 
of action ak- The probability p{ak) of action ak is estimated according to the 
context. But without a priori knowledge, it is fixed to the maximum. 

The probability, p (ak\w), allows only a local decision at location p = (x, y, t)- 
The final result at a given time (t) is the map of the conditional probabilities 
that each pixel belongs to an activity of the training set based on its space-time 
neighborhood appearance. 
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4 Application to the Perception of Human Activities 

The vast amount of raw data generated by digital video units and their poor 
capacities to filter out useless information lead us to develop a framework for 
highlighting specific relevant events according to scene activities. Some exam- 
ples of applications are assisted video-surveillance helping users concentrate their 
attention, or intelligent office environments understanding and reacting to the 
configuration of the scene. In this context the probabilistic framework was trai- 
ned for the perception of human activities of an office fitted out with a camera 
for visual surveillance. 

The wide angle camera allows the surveillance of the whole office. The analy- 
zed activities are “coming in”, “going out”, “sit down”, “wake up”, “dead” (when 
somebody fall down), “first left”, “first right”, “second left”, “second right” and 
“turn left”, “turn right”. Those actions can take place anywhere in the scene 
and under any illumination conditions. A view of the scene and an example of 
the considered activities is shown in figure 0 




Fig. 2. A view of the large visual angle camera. Examples of the analyzed activities are 
shown. Images are 192 x 144 pixels a per pixels and the acquisition rate is 10 Hz. 



4.1 Assumptions and Parameters 

This section deals with the conditions of application to evaluate the probabilistic 
sensor ability to perceive the class of activities defined previously. 

Global conditions: It is assumed that the camera is fixed, therefore there is no 
global motion to compensate. The changes in the scene illumination are uncon- 
trolled and the static objects can move location. Images are 192 x 144 pixels a 
per pixels and the acquisition rate is 10 Hz. 

Receptive fields parameters: All of the results presented in this paper were 

produced with a spatial frequency tuning for each Gabor filter as + Vq = | 
cycles per pixel and a standard spatial deviation of Ux = cfy = 1.49 corresponding 
to a bandwidth of 0.25. The 4 spatial orientations are 0, f, | and Additional 
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scales are obtained using families of filters, which are spaced one octave apart in 
spatial frequency and with a standard spatial deviation which is twice largest. 
The sub-band-filters are tuned for the same temporal frequency wq = j cycles 
per frame and the same temporal scale (Tj = 1.49. 

Histograms computation: The histograms are computed by quantifying the re- 
ceptive fields responses from 1 to 8 bits. Each class of activity is done between 
5 times to 18 times corresponding to 5 people acting anywhere in the scene and 
depending from the activity rate in a scenario. Each sequence of an activity is 
between 11 to 44 frames long. The left hand side of table presumes information 
for each class of activity histograms computation. 



Table 1. Informations on the sequences of each class of activity. The left hand part of 
the table deals with sequences used for histogram computation and right hand part of 
the table deals with test sequences. 



class 

of activity 


number of 
sequences 


number of frames 
per sequence 


total number 
of frames 


number of 
sequences 


number of frames 
per sequence 


total number 
of frames 


in 


5 


26-35 


151 


18 


21-26 


423 


out 


5 


30-36 


166 


18 


18-26 


394 


sit 


18 


14-27 


378 


24 


13-28 


474 


wake 


18 


13-25 


320 


36 


11-33 


759 


dead 


5 


20-23 


105 


12 


10-14 


143 


leftl 


4 


31-41 


146 


15 


10-35 


370 


rightl 


4 


26-41 


140 


12 


21-32 


320 


left2 


4 


29-44 


143 


28 


6-32 


513 


right2 


4 


25-40 


132 


36 


6-34 


785 


turn right 


6 


11-18 


93 


15 


7-28 


182 


turn left 


6 


12-19 


82 


25 


6-33 


299 



Perception: The perception of activities according to Bayes rule (equation 0 
is weighted by the a priori probability p{ak) of action a^. Without a priori 
knowledge the probability p{ak) is fixed to the maximum. 

Test sequences: A set of test sequences are used to evaluate the sensitivity of 
the probabilistic sensor. Those test sequences are different from ones used to 
compute the multi-dimensional histograms. Information on each activity test 
sequences are summarize in the right hand side of table [D 



4.2 Results 

The method presented in this paper is a sensor able to perceive elements of trai- 
ned class of activities. Since the receptive fields integrate temporal information 
over 9 frames and each of the sequences of activities are typically 20 frames 
long, the sensor outputs a sequence of elements, rather than a single response 
element for each trained activity. So it is difficult to qualify its sensitivity and its 
robustness to variations. Regardless, an example of a probabilistic perception of 
the activity “second left” is shown in figure 0 The framework output is a map 
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Fig. 3. Examples of resulting maps of the local probabilities p{ak\w). The original 
image ts in the upper left comer. Maps of the probability that each pixel belongs to one 
of the trained class of activity are shown. White pixels correspond to high probabilities 
and dark pixels to low probabilities. The occurring activity “second left” which has been 
recognized is highlighted in red. 



of the local probabilities p{ak\w) that each pixel belongs to one of the trained 
class of activities. 

To evaluate the sensor ability to perceive different classes of activity a glo- 
bal decision rule is first designed with input the map of the local probabilities. 
Then the recognition rate is evaluated in function of the number of bits used to 
estimate the density probability of each class of activity, and in function of the 
number of receptive fields. 

Decision rule: A global decision is taken by selecting the largest, p (ak\w) among 
the K classes. The class of activity which has the largest number of largest proba- 
bilities is selected for recognition. An example of activity recognition using such 
a rule is shown in figure 13 where the class of activity “second left” is highlighted 
in red. 

Histograms quantification: The subspace of receptive fields responses presents 
a large number of dimensions and histograms are quite sparse. To bring to the 
fore the sparseness of histograms, the recognition rates are studied as a function 
of the quantification rate of the histograms and as a function of the number of 
dimensions in the subspace of receptive fields. The graphs of figureEldeal with the 
evolution of recognition rates as a function of the number of bits per dimensions 
used to represent histograms. The left hand side of the figure 0 summarizes 
results in a subspace using only one range of Gabor filters (corresponding to one 
standard spatial deviation Us = 1.49). In this case the number of dimensions is 
4, corresponding to the 4 orientations of the receptive fields. Over an histogram 
quantification rate of 5 bits the histograms cells are empty and Bases rule is 
unusable. But below a quantification rate of 4 bits histograms overlaps and 
activities are confused. The right hand side of figure 0| relates result with the 
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three scale ranges of filters, corresponding to a subspace of receptive fields of 12 
dimensions. The graphs show that histograms are too sparse and a quantification 
rate of 2 or 3 bits is the limit. It appears clearly that the training set of sequences 
of activities is not large enough for such a large appearance space. 





Fig. 4. Recognition rates per class of activity as a function of the quantification rate 
of histograms (in bits). The left hand side figure deals with results using a fD subspace 
and the right hand side figure with results for a 12D subspace. 



Recognition rates: Recognition rates are compared using histograms computed 
over one range of receptive fields (4D subspace) and coded with 4 bits per di- 
mension, and with histograms computed over a 12D subspace and coded with 2 
bits per dimension. Table El summarizes recognition rates for each class of activi- 
ties. Notes that activities “turn left” and “turn right” are not recognized. The 



Table 2. Recognition rates for each class of activity of the test sequences. The first 
row deals with results using a )D subspace with histograms computed with f bits per 
dimension. The second row show results for a 12D subspace with histograms computed 
with 2 bits per dimension. The activities “turn left” and “turn right” are not recognized. 



% 


in 


out 


sit 


wake 


dead 


leftl 


right 1 


left2 


right2 


4D - 4 bits 


11.1 


5.4 


57.6 


56.5 


5.7 


64.5 


92.5 


58.5 


82.1 


12D - 2 bits 


35.0 


12.2 


64.5 


68.5 


10.0 


65.9 


90.9 


65.5 


78.7 



activities “in”, “out” and “dead” are not well perceived. 

There are two reasons why those activities can not be discriminated. The 
first reason is that the acquisition rate is only 10 Hz, and it isn’t enough to 
catch the motion information of short time activities. The second reason comes 
from the decision rule which is not rich enough to take into account the temporal 
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Table 3. Confusion matrix of class of activities for the test sequences. The first left 
column deals with input activities an the first upper row are outputs. Each cell is the 
number of output labels for the corresponding input. The last right column is the total 
number of inputs per class of activity. 





in 


out 


sit 


wake 


dead 


leftl 


rightl 


left 2 


right2 


turn right 


turn left 


total 


in 


148 


22 


8 


4 


0 


43 


39 


85 


8 


0 


0 


423 


out 


28 


48 


3 


25 


0 


71 


107 


8 


27 


0 


2 


394 


sit 


9 


2 


306 


29 


2 


30 


46 


32 


13 




1 


474 


wake 


23 


11 


22 


520 


2 


66 


62 


11 


41 


0 


1 


759 


dead 


3 


U 


88 


4 


14 


0 


8 


24 


0 


0 


0 


143 


leftl 


1 




1 


1 


0 


244 


9 


53 


60 


0 


0 


370 


rightl 


1 


U 


5 


7 


0 


0 


291 


16 


0 


0 


0 


320 


left2 


16 




120 


0 


1 


8 


31 


336 


0 


0 


0 


513 


right2 


U 


11 


1 


103 


0 


40 


9 


0 


618 


0 


0 


785 


turn right 


4 




25 


21 


0 


44 


25 


40 


21 


0 


1 


182 


turn left 


22 


19 


26 


36 


1 


66 


42 


27 


51 


0 


0 


299 



complexity of activities. Table El deals with the confusion matrix of activities. It 
appears that the activity “in” is composed of “in” and “ri^/iif ”, the activity “out” 
is composed of “out”, “leftl” and “rightl”, and the activity “dead” is composed 
of “sit” and “left2”. And so on. The figure 0 shows examples of sequences of the 

(DEAD) 

(IN) 

(WAKE) 

(SIT) 

► 

t 

Fig. 5. Examples of sequences of the probabilistic activity sensor outputs. The inputs 
are respectively sequences of the activities “dead”, “coming in”, “wake up” and “sit 
down” . 



probabilistic activity sensor outputs for the inputs activities “dead”, “coming 
in”, “wake up” and “sit down”. For example the activity “sit down” is perceive 
with two mains components which are effectively the activity element “sit down” 
followed by the activity element “first right”. This time decomposition is natural 
since the end of the action sit down is a pure horizontal translation corresponding 
to the phase when the person leans back onto the chair back. 




4.3 Conclusion 

The probabilistic sensor allows the discrimination of several class of complex 
elements of activities. Results are encouraging and make clear several points: 

— A large subspace of receptive fields is necessary to perceive and discriminate 
complex activities. Only one range of receptive fields corresponding to a 4D 
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subspace is not enough to catch signal disparity. Three ranges of receptive 
fields (12D subspace) have given interesting recognition rates. 

— A difficulty is to make up a training basis of classes of activities large enough 
to allow multi-dimensional histograms computation. It has been shown that 
histograms computed over such a large subspace (12D) are quite sparse. But 
sparseness can be limited by enlarging the training set, and results can be 
improved. 

— Improvements can be obtained by increasing the acquisition rate, in order to 
catch finer temporal motion information, like opening the door in the case 
of the activity “coming in”. 

The main difficulty is still in the definition of a recognition framework allo- 
wing the evaluation of the robustness of the activity sensor, and to evaluate its 
sensivity to the histograms computation and to the receptive fields selectivity. 
The decision rule used previously is not rich enough to take into account that 
activities elements are complex. The sensor output is a sequence of the activity 
elements detected over a temporal window which is short relatively to the dura- 
tion of the input activity. A good cue is to use temporal sequences of decisions 
(see figure El) as input of a more global decision scheme. The next section define 
a complex global decision scheme based on Hidden Markov Models. 

5 A Hidden Markov Model Based Recognition Scheme 

The output of the probabilistic sensor is the temporal decomposition of complex 
activities into the most probable class of short activities elements. Since this 
temporal decomposition is difficult to predict for a given class of activity the 
use of Hidden Markov Model seems appropriate for recognition. Hidden Markov 
Model, HMM, are doubly stochastic models, because they income an underlying 
stochastic process that is not observable. HMMs are appropriate for modeling 
and recognizing time-warping dynamic patterns. HMMs have been popularized 
in the application area of speech recognition. Recently, HMMs have also been 
employed for gesture recognition and activities recognition. 

5.1 Discrete Hidden Markov Model 

A discrete Hidden Markov Model can be view as a nondeterministic finite au- 
tomaton. Each state, Si, is characterized by a transition probability, aij, (the 
transition probability to reach state Sj from state Si), an initial state probability 
TTi and a discrete output probability distribution, bi(Ok), which defines the con- 
ditional probability of emitting observation symbol, Ok, from state s^. HMM is 
denoted by A = (A, B, tt) where A is the transition matrix, A = {a,ij}ij, B is the 
observation probability vector, B = {bi{Ok)}ik, and 7T is the initial probability 
vector, n = {TTi}i. 

The transition matrix. A, defines the topology of the automation. In the 
general case, all values of aij are defined and the HMM is called ergodic. If A 
is band diagonal, HMM is left-right A left-right HMM is appropriate when a 
temporal order appear. 
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5.2 HMM Based Activities Recognition Scheme 

This section detail the different steps to design our approach for a discrete 
HMM-based activities recognition thus using one HMM per class of activity 
for classifying. 

1. Describing an HMM for each activity: 

A HMM is employed to model each activity, a, which is characterized by 
Aa = (^a, Ba, Ua) ■ Even though values of element in Aa, Ba and Ila will be 
estimated in the training process, the structure of matrix, Aa, have to be 
determined. The structure considers in the same time the topology of the 
model (ergodic or left -right) and the number of states. The number of states 
can be determined using different methods. 

exhaustive fa priori) : testing all possible number of states between one 
and an arbitrary selected number and selecting the number which ma- 
ximize the probability of recognition. 

heuristic (a posteriori) : the number is selected by studying the problem 
and the observation sequences. 

automatic : the maximization of Bayesian Information Criterion (BIC) 
m. fh8| provides an automatic method to find the number of states. 
Given a training sample, Sa, for activity a, the criterion is defined as: 

BIC{Xa,Na) = log P{Sa\Xa,Na,$) ~ log{car d{S a)) (8) 

In the above equation, vx^^Na is the number of independent parameters 
in the HMM Aa composed of Na states and <p is an estimator of maximum 
likelihood. The equation can be viewed as the difference between a term 
measuring the appropriateness of data to the model, and a penalty term 
which penalizes models with a great number of independent parameters. 

2. Training the HMMs: 

For each class of activity (i.e. HMM), the model parameters Aa = {Aa,Ba, 
Ba) are adjusted in order to maximize the likelihood P(5a|Aa), the proba- 
bility of observing a training sample, Sa, given the model parameters, Aa. 
Baum- Welch’s re-estimation formulas is used to to reestimate model para- 
meters to achieve a local maximum. 

3. Classifying new activity: 

Given an observations sequence of an unknown activity, O, the classification 
process estimate the class, a*, such that: 

a* = arg max P{Xa\0) (9) 

l<a<Af 

In many cases, only P{0\Xa) is known. Bayes rule allows computation of 
P{Xa\0) = kP{0\Xa) where fc is a constant depending of the probability of 
each activity. The activities are considered with equal probability, k = ^. 
The Baum’s forward-backward procedure is used to compute efficiently the 
probability P(0|Aa). 
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HMMs are composed of several parameters: observation sequences, HMMs 
topology and number of states. Next paragraphs deal with those parameters and 
experimental results useful to select there values. 

Sequences of observable symbols The vocabulary used as input to the HMMs are 
the different classes of elements of activities from the probabilistic sensor (see 
section EJ. But only the activities (composed of several elements of activities) 
“sit down”, “wake up”, “first left”, “first right”, “second left” and “second right” 
are studied. The reason why is that activities “coming in”, “going out”, “dead” 
(as somebody fall down), “turn left” and “turn right” are not considered because 
of a large number of confusion (see table 0|). 

HMMs Topology The nature of the activities and the outputs provided by the 
probabilistic sensor (see section Eland table EJ result in a succession of different 
elements of activity in a complete activity. This tendency fits with the left-right 
topology. 

Number of states The number of states can be estimated using methods pre- 
sented in section 15.21 The number of states fixed a priori (heuristic method) or 
estimated by the Bayesian Information Criterion converges to the same value 
which 2 states per activity. 

Training sets HMMs are trained with 130 sequences divided in A/” = 11 clas- 
ses as shown in table 0 The training set is too small to estimate efficiently the 
HMM. This set allows to have preliminary results and to estimate the feasibility 
of such recognition. If we consider left -right HMMs with two states, the num- 
ber of parameters to estimate is 26: 2 for the transition matrix, 2 for the initial 
probability vector, 2 x 11 for the observations probability vector (one for each 
states). Considering between 10 and 20 example per parameters, for future ex- 
periments we will have to compose a training set between 260 and 520 example 
per activity. 

Table 4. Number of training sequences for each class of activity. 



Classes of activity 


sit 


wake 


leftl 


rightl 


left2 


right2 


Total 


Number of sequences 


23 


34 


12 


12 


22 


27 


130 



5.3 Recognition of Activities 

This section presents preliminary results on the recognition of complete activi- 
ties. In this experiment, we have used a cross validation on the training set 
presented in section 15. '/!l From the 130 sequences, one is extracted for recogni- 
tion, all the remaining 129 sequences are used to train the 6 HMMs per activity 
to be recognize. Table 0 shows recognition rates. 
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Table 5. Recognition rates of activities. A cross validation is used. 



Classes of activity 


sit 


wake 


leftl 


rightl 


left2 


right2 


Total 


Recognition rates (%) 


91% 


88% 


83% 


92% 


82% 


85% 


87% 


Number of misclassified activities 


2 


4 


2 


1 


4 


5 


18 



Table Elshows promising results. We obtain a global recognition rate of 87%, 
corresponding to 18 misclassified activities on 130 ones. Those misclassified ac- 
tivities are due to the small number of training examples which imply impossi- 
bility to compute the probability of some observations sequences or the misclas- 
sification. 

6 Conclusion and Perspectives 

A new approach for activity recognition has been presented. Recognition of ac- 
tivity elements is processed statistically according to the conditional probability 
that a measure of the local spatio-temporal appearance is occurring for a given 
action. Then a temporal regularisation of perceived activity elements is done to 
recognize complex activities. 

This paper describes work in progress and experimental results are limited 
but encouraging. Further experiments will attempt to quantify the limits of the 
technique. Also several technical details must be resolved to provide improved 
results. On one hand the vector of receptive fields responses is sensitive simul- 
taneously to three motion ranges. The space and time scales have been selected 
to ensure large bandwidth. Since multi-scale strategies are redundant, a solution 
will be to select automatically local scale parameters according to the maxima 
over scales of normalized derivatives |Lint)8| . On the other hand the framework 
presented in this paper is sensor able to perceive activities previously learned. 
Enlarging the training basis of each class of activities will certainly improve 
results since instabilities comes from the histograms sparseness. 

The output of the probabilistic sensor is the temporal decomposition of com- 
plex activities (about 20 frames) into the most probable class of short activities 
elements (9 frames). Since the temporal aperture window of description is re- 
latively small compared to the temporal duration of activity. Hidden Markov 
Models are employed to regularize the recognition. In a sense, the H.M.M. pro- 
vides context. The temporal sequences of decisions are used as input of H.M.M. 
for the recognition of complex activities. It has been shown that some misclas- 
sification are due to the lack of training examples. Further experiments using 
larger training set will be done soon. 

Nevertheless, plugging the perception of activities framework in an intelligent 
office environment controlled by a supervisor is highly considered. If the intelli- 
gent environment knows where people are in the scene, the a priori probability 
of each class of activities could be estimated according to the context (context 
cells). Introducing this a priori knowledge into the Bayes rule will improve the 
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sensitivity of activities. For example if the tracked person comes in front of a 
computer the probability that the action “sit down” occurs is higher than the 
“going out” one. 

Note that the probabilist framework for the perception of activities runs at 
10 Hz on a standard bi-Pentium III 600 MHz PC. 
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Abstract. We introduce a 3 x 3 x 3 tensor and its dual 

which represent the 2D projective mapping of points across three pro- 
jections (views). The tensor is a generalization of the well known 

2D collineation matrix (homography matrix) and it concatenates two 
homography matrices to represent the joint mapping across three views. 
The dual tensor Htju concatenates two dual homography matrices (map- 
pings of line space) and is responsible for representing the mapping as- 
sociated with moving points along straight-line paths, i.e., Htju can be 
recovered from line-of-sight measurements only. 



1 Introduction 

In this paper we revisit the fundamental element of projective geometry, the 
colUneation (also referred to as homography matrix) between two sets of points 
on the projective plane undergoing a projective mapping. The role of homogra- 
phy matrices responsible for mapping point sets between two views of a planar 
object is basic in multiple- view geometry in computer vision. The object stands 
on its own as a point-transfer vehicle for planar scenes (aerial photographs, for 
example) and in applications of mosaicing, camera stabilization and tracking 
[6]; a homography matrix is a standard building block in handling 3D scenes 
from multiple 2D projections: the “planej-parallax” framework [7, 4, 5, 2] uses 
a homography matrix for setting up a parallax residual field relative to a pla- 
nar reference surface, and the trifocal tensor of three views is represented by a 
“homography-epipole” structure whose slices are homography matrices as well 
[3,8]. 

In our work we first consider a 3- view version of a projective mapping rep- 
resented by a 3 X 3 X 3 contravariant tensor H'‘^^ , referred to as a homography 
tensor (abbreviated as ’’Htensor”). The entries of the Htensor are bilinear prod- 
ucts of the orginal pair of homography matrices and its 27 coefficients can be 
recovered linearly (up to scale) from 4 matching points (lines) across the three 
views. The Htensor can perform directly the image-to-image mapping or alterna- 
tively the original piarwise collineations can be linearly recovered from the slices 
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of the tensor. The 27 entries of the tensor satisfy a number of non-linear con- 
straints which are unique to the coupling of three views together — in pairwise 
projective mappings all constraints are linear. 

We next consider the dual Htensor, a covariant form Hijk whose constituents 
are dual homography matrices (mapping of line space). The dual Htensor has an 
interesting “twist” in the sense that it applies to projections of moving points 
following straight-line paths on a planar surface. In other words, Up' p'^ p"^ Hijk = 
0 then the three optical rays dehned by the image points p,p' ,p" in views 1,2,3 
respectively meet at a Une. Consequently, the dual tensor opens new applications 
in which static and moving points live together as equal partners — the process 
of mapping and estimation of the dual tensor from image measurements need 
not know in advance what is moving and what is static. 



1.1 Background and Notations 

We will be working with the projective plane, i.e., the space . Points and lines 
are represented by triplets of numbers (not all zero) that dehne a coordinate 
vector. Consider a collection of planar points Pi, ...,Pn in space living on a plane 
7T viewed from two views . The proj ections of Pi are pi , p'- in views 1 , 2 respectively. 
There exists a unique collineation (homography) 3x3 matrix that satishes 
the relation Aj^pi = p'i, i = 1, ..., n, and where Aj^ is uniquely determined by 4 
matching pairs from the set of n matching pairs. Moreover, A~'^ s = s' will map 
between matching lines s, s' arising from 3D lines living in the plane tt. Likewise, 
A~l s' = s will map between matching lines from view 2 back to view 1. 

It will be most convenient to use tensor notations from now on because 
the material we will be using in this paper involves coupling together pairs of 
collineations into a “joint” object. The distinction of when coordinate vectors 
stand for points or lines matters when using tensor notations. A point is an object 
whose coordinates are specihed with superscripts, i.e., p' = (p^,p^,p^). These 
are called contravariant vectors. A line in is called a covariant vector and 
is represented by subscripts, i.e., Sj = (si,S 2 ,S 3 )- Indices repeated in covariant 
and contravariant forms are summed over, i.e., p'si = p^si + p^S 2 +p^S 3 . This 
is known as a contraction. For example, if p is a point incident to a line s in , 
then p'si = 0. 

Vectors are also called 1-valence tensors. 2-valence tensors (matrices) have 
two indices and the transformation they represent depends on the covariant- 
contravariant positioning of the indices. For example, aj is a mapping from 
points to points (a collineation, for example), and hyperplanes (lines in 
to hyperplanes, because ajp' = and ajsj = r,- (in matrix form: Ap = q 
and = r); aij maps points to hyperplanes; and maps hyperplanes to 

points. When viewed as a matrix the row and column positions are determined 
accordingly: in aj and aji the index i runs over the columns and j runs over 
the rows, thus = cf is BA = C in matrix form. An outer-product of 

two 1- valence tensors (vectors), ailA , is a 2- valence tensor whose i,j entries 
are aiV — note that in matrix form C = ba~^ . A 3- valence tensor has three 
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indices, say . The positioning of the indices reveals the geometric nature of 
the mapping: for example, p^SjHj'^ must be a point because the i,j indices drop 
out in the contraction process and we are left with a contravariant vector (the 
index k is a superscript). Thus, maps a point in the hrst coordinate frame 
and a line in the second coordinate frame into a point in the third coordinate 
frame. A single contraction, say of a 3- valence tensor leaves us with a 

matrix. Note that when p is (1, 0, 0) or (0, 1, 0), or (0, 0, 1) the result is a “slice” 
of the tensor. 

We will make extensive use of the “cross-product tensor” e dehned next. 
The cross product (vector product) operation c = axb is dehned for vectors in 
■p^. The product operation can also be represented as the product c = [a]x& 
where [a] x is called the “skew-symmetric matrix of a” . In tensor form we have 
Cijko'V = Cfe representing the cross product of two points (contravariant vectors) 
resulting in the line (covariant vector) c^. Similarly, e^^^aibj = represents 
the point intersection of the to lines ai and hj. The tensor £ijk is the anti- 
symmetric tensor dehned such that Cijka^Vc^ is the determinant of the 3x3 
matrix whose columns are the vectors a,b,c. As such, Cijk contains 0,-fl,— 1 
where the vanishing entries correspond to arrangement of indecis with repetitions 
(21 such entries), whereas the odd permutations of ijk correspond to —1 entries 
and the even permutations to -fl entries. 

In the sequel we will reserve the indices i,j,k to represent the coordinate 
vectors of images 1,2,3 respectively. We will denote points in images 1,2,3 as 
p,p',p" respectively, thus in a tensor equation these points will appear with 
their corresponding indecis — for instance, p^ p'^ p"^ Hijk = 0. 



2 Homography Tensor 



Consider some plane tt whose features (points or lines) are projected onto three 
views and let A be the collineation from view 1 to view 2, and B the collineation 
from view 1 to 3 (we omit the reference to tt in our notation). Let P be some 
point on the plane tt and its projections are p,p',p" in views 1,2,3 respectively. 
Let q, s,rhe some line through p, p' ,p" respectively. We have q~^ (A"''sx 5"'' r) = 0 
because A"''s is the projection of the line L on tt onto view 1, where L projects 
to s in view 2, and similarly r is a line in view 1 matching the line r in view 
3. These two lines must intersect at p (see Fig. 1). In tensor form we have: 

g,s,r,(C"“a^„&^)=0, (1) 



and we denote the object in parenthesis 



pjijk 






( 2 ) 



as the Homography Tensor (in short, Htensor). In the remainder of this section 
we will investigate the properties and uses of the Htensor along the following 
lines: 
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— Number of matching points (lines) required for a unique solution for the 
Htensor (Proposition 1,2). 

— Slicing properties of the Htensor and the means for recovering the origi- 
nal homography matrices A, B from the Htensor (Theorem 1). The nature 
of these slices provide the source for 11 non-linear constraints among the 
tensor’s coefficients. 

— Image-to-image mapping using the Htensor. 




Fig. 1. The lines s, r are mapped by the dual collineations onto view 1, satis- 

fying the relationship (A'^ s x r) = 0. 



The hrst result is that 4 matching triplets provide 26 linearly independent 
constraints on the 27 coefficients of 77*1*’ — hence, a unique solution (up to 
scale) is provided from image measurements of 4 points. 

Proposition 1. The n’th matching triplet provides 8 — (n — 1) linearly inde- 
pendent constraints to the constraints provided from the previous n — I matching 
triplets. Hence, 4 matching triplets provide 8-|-7-|-6-|-5 = 26 linearly independent 
constraints. 

Proof: The hrst matching triplet provides 8 linearly independent constraints 
because a point is spanned by two lines. Take for example the vertical qj = 
(—1,0,*) and the horizontal q'f = (0,-1, j/) lines passing through the point p, 
and similarly the horizontal and vertical lines sj , s'j through the point p' and the 
lines r\, r'j, through p” and we have the eight constraints: 

qfsy,W^^ = t), p,p,n=l,2. 

Consider the second matching triplet P 2 ,P 2 jP 2 l^e constraints dehned 

by selecting the line q to pass through p 2 and p, the line s to pass through p '2 and 
p' and the line r to pass through p” and p” . Clearly, these lines are spanned by the 
lines through p,p' ,p" , thus the added constraint is linearly spanned by the eight 
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constraints provided by p,p' ,p". Hence, the second matching triplet contributes 
only 7 additional linearly independent constraints to the 8 constraints from the 
hrst matching triplet (see Fig. 2). Continue by induction, the n’th point has lines 
to the previous n — I points, thus n-1 constraints from the 8 possible constraints 
are already covered by the previous points. [] 




Fig. 2. A triplet of matching points provides 8 constraints. A second triplet provides 
only 7 additional constraints because the constraint defined by the lines connecting the 
the two sets of points is already covered by the 8 constraints of the first triplet. 



Note that the lines q, s, r in eqn. 1 are not matching lines although they pass 
through matching points. Our next issue is to show that if q, s, r are matching 
lines, then they provide 7 linear constraints, thus 4 matching line triplets provide 
28 constraints and a unique solution to H'‘^^ . 

Proposition 2. Ifq, r, s are matching lines in mews 1,2,3 then qiSj ,qirkH^l^ 
and Sjr'kH'^l^ are null vectors providing a total of 1 linearly independent con- 
straints on the Htensor. Thus 4 matching lines provide a unique linear solution 
for 

Proof: Ifg,s, r are matching lines then the rank of the matrix whose columns 
are [q, A'''s, r] is 1. Thus, q x A'''s = 0,g x r = 0 and s"''xl x B^ r = 0. In 
tensor form, these translate to the following: 



qiSjCkH"^^ = D Mck, 
q.eqrkH^^'^ = 0 Ve^, 
eiSjrkH^^'^ = 0 Ve,-. 

Note that qiSjr'kH'^l^ appears three times (once in every row above), thus among 
the nine constraints arising from the fact that qiSj ,qirkH^^^ and sjrkH'‘^^ 
are null vectors two of the constraints are already accounted for making the total 
of 7 linearly independent constraints. [] 
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So far we have shown that 4 matching points or 4 matching lines across 
the three views provide a unique solution to the Htensor — just like with 
collineations: 4 is the number of points or lines that is required for a unique 
solution. We turn our attention next to single and double contractions of the 
Htensor — what can be extracted from them and what is their geometric signif- 
icance. 

The double contractions perform mapping operations. For example, 
must be a point (contravariant index k is left uncontracted) whose scalar product 
with the pencil of lines through p" vanishes — hence, the point must be p" . We 
have therefore: 



= p"" 
qiVuH'^^ = p'^ 

= p' 

A single contraction is a correlation mapping of lines to points associated 
with a Linear Line Complex (LLC) — a set of lines that have a common line 
intersection called the kernel of the set — which we will derive as follows. Con- 
sider some arbitrary covariant vector and the resulting matrix . One 

can easily verify, by substitution in eqn. 2, that the resulting matrix is if = A[p\x 
where p = B~^ S. What does the matrix E stand for? Let L be the line on tt at 
the intersection of tt and the plane dehned by the line (i in view 3 and its center 
of projection (see Fig. 3). The projections of L in view 1 and 2 are p = B~^ S 
and p = A~~^ p. Clearly, Ep = 0 and E~^ p = 0. Consider any other line S in- 
tersecting L in space {S is not necessarily on tt) projecting onto s, s' in views 
1,2 respectively. Then s'~^ Es = 0. Taken together, the matrix E maps lines in 
view 1 onto collinear points (on the line p) in view 2. The set of lines S in 3D 
whose projections s, s' satisfy s'~^ Es = 0 dehne an LLC whose kernel is the line 
L whose projections are the null spaces of E and E~^ . Moreover, AE~^ is a skew 
symmetric matrix, thus AE~^ + EA~^ = 0 provides 6 linear constraints on the 
homography matrix A. 

By selecting d to range over the standard basis (1, 0, 0), (0, 1, 0), (0, 0, 1) we 
obtain three slices of H'‘E which we will denote by Ei, E 2 , E 3 . These slices 
provide 18 linear constraints for the homography matrix A. Likewise, the three 
slices provide 18 constraints on the homography B and the three 

slices pj-Qvide 18 constraints on the homography C = BA ^ 

from view 2 to view 3. We summarize these hndings in the following theorem: 

Theorem 1. Each of the contractions and represents a 

correlation mapping between mews (2,3), (1,3) and (1,2) respectively, associated 
with the LLC whose kernel is the line at the intersection of tv and the plane de- 
fined by 6 of views 1,2,3 respectively and the corresponding center of projection. 
By setting d to be (1,0,0), (0, 1,0) or (0,0, 1) we obtain three different slicings 
of the tensor: denote the slices of by the matrices Gi, G 2 , G 3 , the slices 

ofSjH'l^ by the matrices W\,W 2 ,W 3 , and the slices of by the matri- 

ces E\, E 2 , E 3 . Then these slices provide sufficient (and over-determined) linear 
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Fig. 3. The contraction is a matrix E = where fj = B^S. The covariant 

vector S represents a line in view 3 which together with the center of projection repre- 
sents a plane whose intersection with tt is a line L. The projections of L in views 1,2 
are the null spaces of E and E' respectively, i.e., E/j, = 0 and E'rj = 0. Matching 
lines s, s' in views 1,2 satisfy s'~^ Es = 0 if and only if the corresponding 3D lines form 
a LLC whose kernel is L, i.e., a set of lines S that intersect at L. 



constraints for the constituent homography matrices A, B and for C = BA ^ : 

CGj + GiC~^ = 0, (3) 

BW^^ + WiB~^ = 0, (4) 

AEj + EiA^ = 0, (5) 

/ori = 1,2,3. 

Theorem 3 provides the basis for deriving the ’’internal consistency” con- 
straints which are 11 non-linear functions on the elements of the tensor that 

must be satisfied. The details can be found in the full version of this work in [9]. 

The slicing breakdown can also be useful for performing a direct image-to- 
image mapping, thus bypassing the need to recover the constituent homography 
matrices A,B. Consider two slices SkH'G and pkH'G for some S,/! and denote 
the matrices by Ei, E^. Let p' = si x S 2 for some two lines si, S 2 - One can verify 
that: 

p = {EJ Si X E^ si) X {EJ S2 X E^ S 2 ) 

Thus, given p' and the tensor EPG one can determine directly the matching 
point p. 

2.1 Concluding Notes 

In summary, we have introduced the tensor EPG representing the joint map- 
ping among three views of a planar surface. The tensor is determined uniquely 
by 4 matching points or 4 matching lines but in addition lives in a lower di- 
mensional manifold — a fact that places internal non-linear constraints on the 
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27 entries of the tensor. These internal constraints ensure the group property 
of collineations arising from a single planar surface. In other words, the con- 
catenation of collineations to form a joint mapping could be useful in practice 
when dealing with sequence of views of a planar surface. Furthermore, we have 
described in detail the tensor contractions (slices) and their geometric role — 
notably the role played by the Linear Line Complex conhguration. 

It is worthwhile noting that the structure of the Htensor bares similarity to 
the structure of the quadrifocal tensor which is contracted with 4 lines 

across 4 views: qiSjrktiQ'^^^ = 0 where q,s,r,t are coincident with 4 matching 
points p,p',p",p'" across the images. The difference is that applies to 

a general 3D world whereas the Htensor applies to a coplanar conhguration. 
Proposition 1, for example, stating that the number of linear constraints drop 
gradually as matching points are introduced is analogous to the gradual drop 
in independent constraints for the quadrifocal tensor. In fact, the Htensor is 
a contraction, say a slice, of the quadrifocal tensor where the plane 

7T is determined by the choice of C in the example — and as such one can 
represent the quadrifocal tensor as a sum of three outer-products of epipoles 
and Htensors. Further details on this issue can be found in a companion paper 
in these proceedings [10]. 

In the next section we will explore the dual form Hijk of the Htensor. The 
dual form turns out to be of particular interest as it applies to dynamic point 
conhgurations — a feature which opens up new application frontiers as well. 

3 The Dual Homography Tensor Hijk 

Consider the tensor made up from but replacing the homographies A, B 

(in eqn. 2) by their duals A~'^ and B~^ . Denote A' = A~^ and B' = B~^ , then 
the dual homography tensor is the covariant tensor described below: 

Hijk = Unua'ph'^. ( 6 ) 

Because Hijk is a covariant tensor it applies to 3 points, one in each view. 
Consider, therefore, the contraction p'^ p"^ Hijk = 0. What does that entail on 
the relationship between p,p' ,p"l We have: 

fp'^p”^Hijk = p'^iA'p' X B'p") = det{p,A'p',B'p”) = 0. 

In other words, p^p'^ p"^ Hijk = 0 when the rank of the 3x3 matrix [p, A'p' , B'p"] 
is either 1 or 2. The rank is 1 iff the points p,p',p" match in the usual sense 
when the three optical rays intersect at a single point in space. The rank is 2, 
however, when the three optical rays meet at a line on ir because then p, A'p' and 
B'p" are collinear points in view 1 (note that A' , B' are collineations from view 
2 to 1 and view 3 to 1, respectively — see Fig. 4). We thus make the following 
dehnitions: 

Definition 1. A triplet of points p,p',p" are said to be matching with re- 
spect to a static point if they are matching in the usual sense of the term, 
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t.e., the corresponding optical rays meet at a single point. The triplet are said to 
be matching with respect to a moving point if the three optical rays meet 
at a line. 

We see from the above that the dual Htensor Hijk applies to both static and 
moving points coming from the planar surface tt. The possibility of working with 
static and moving elements was introduced recently in [1] where it was shown 
that if a moving point along a general (in 3D) straight path is observed in 5 
views, and the camera projection matrices are known, then it is possible to set 
up a linear system for estimating the 3D line. With the dual Htensor Hijk, on 
the other hand, we have no knowledge of the camera projection matrices, but on 
the other hand we require that the straight paths the points are taking should 
all be coplanar (what makes it possible to work with 3 views instead of 5 and 
not require prior information on camera positions). We have from above the 
following theorem: 

Theorem 2. The tensor Hijk can be uniguely defined from image measurements 
associated with moving points only. A triplet of matching points p,p',p" with 
respect to a moving point on tt contributes one linear constraint p^ p'^ p"^ Hijk = 0 
on the entries of Hijk. 




Fig. 4. The dual homography tensor and moving points. The collineations A' , B' are 
from view 2 to 1 and 3 to 1 respectively. If the triplet pj,pj,pj" are projections of 
a moving point along a line on tt then p, A'p' , B'p" are collinear in view 1. Thus, 
p~''{A'p' X B'p") = 0, or p'p'" p"*' Hijk = 0 where Htjk = emuafib'fi . 



With 26 matching triplets with respect to moving points on tt we can obtain 
a unique linear solution of the dual tensor. From the principle of duality with 
Proposition 1 we can state that there could be at most 8 moving points on 
the hrst line trajectory on tt, at most 7 moving points on the second line, at 
most 6 points on the third line and at most 5 points on the fourth line. The 
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four trajectory lines should be in general position, i.e., no three of them are 
concurrent. 

If a triplet of points p,p' ,p" are known to arise from a static point, then by 
principle of duality with Proposition 2 such a triplet provides 7 constraints on 
Hijk and thus 4 matching triplets that are known to arise from static points (in 
general position) provide a unique solution for for the dual tensor. 



3.1 Mixed Static and Dynamic Points 

We have so far applied the principle of duality to assert that the 26 matching 
triplets with respect to moving points should be arranged along at least 4 line 
trajectories and that 4 matching triplets arising from static points are sufhcient 
for a solution of the dual Htensor. The dual tensor raises also the possibility 
of handling a mixed situation where some of the matching triplets arise from 
moving points and some from static points — but without any prior knowledge 
of what is static and what is dynamic. We call this situation of having a matching 
triplets without a label of whether they arise from a static or dynamic point as 
an ’’unlabeled matching triplet”. In this section we will address the following 
issues: 

— In an unlabeled situation what is the maximum number of matching triplets 
arising from static points that are allowed for a unique solution? We will 
show that the number is 10, i.e., that among the 26 triplets at least 16 
should arise from moving points. 

— In case * < 4 of the matching triplets are labeled as static, how many moving 
points are required for a unique solution? We will show that we need 16 — 4* 
triplets arising from moving points. 

Theorem 3. In a situation of unlabeled matching triplets arising from a mix- 
ture of static and moving points, let x < A be the number of labeled matching 
triplets that are known a prion to arise from static points. If x = 0, then the 
matching triplets arising from static points contribute at most 10 linearly inde- 
pendent constraints, therefore the minimal number of matching triplets arising 
from moving points must be 16. In general, the minimal number of matching 
triplets arising from moving points is 16 — Ax for x < A. 

Proof: It is sufficient to prove this theorem for the case where A = B = I 
(the identity matrix) — because all other cases are transformed into this one by 
local change of coordinates. 

Consider hrst the case * = 0, i.e., all 26 measurements are of the form 
p^p'l p"^ Hijk = 0 regardless whether the matching triplet arises from a static 
or moving point. We wish to show that the dimension of the estimation matrix 
in case all the measurements arise from static points is 10. Each row of the 
estimation matrix is some “constraint tensor” such that Hijk = 0. In 

the case A = B = I , is a symmetric tensor if the matching triplet arises from 
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a static point (because p = p' = p"), i.e., remains the same under permutation 
of indices — hence contains only 10 different groups of indexes 

111,222,333, 112, 113,221,223,331,332,123 

up to permutations. Therefore, the rank of the estimation matrix from unlabeled 
static points is at most 10 , and in order to solve for the tensor we would have 
to use at least 16 additional moving points. 

Consider the case x = I, i.e., one of the matching triplets contributed 9 
constraints of rank 7 : 

= 0 feip'"^Hijk = 0 e\p'^ p'"^ Hijk = 0 

fp'^e'lHijk = 0 feip'"^Hijk = 0 e\p'ip'"^Hijk = 0 

p'p'^elHijk = 0 p^ey"^Hijk = 0 eip'^p'"^Hijk = 0 , 

where 61 , 62,63 are the standard basis (1, 0, 0), (0, 1, 0), (0, 0, 1). Note that be- 
cause A = B = I , then p = p' = p" . Add the three constraints in the hrst 
row: 

Then, is a symmetric tensor and thus spanned by the 10-dimensional sub- 
space of the unlabeled static points. Likewise, the constraint tensors resulting 
from adding the constraint of the second and third row above are also symmetric. 
Taken together, 3 out of the 7 constraints contributed by a labeled static point 
are already accounted for by the space of unlabeled static points. Therefore, each 
labeled static point adds only 4 linearly independent constraints. [] 

3.2 Contractions of dnal Htensor 

A double contraction of the tensor performs a point-point to line mapping. For 
example, p'p'^ Hi jk is a line in view 3 which is the projection of the line on tt 
traced by the moving point onto view 3. In other words, given any two non- 
matching points p, p' let the line passing through the two intersecting points 
between the optical rays and tt be denoted by L. Then, p'p'^ Hijk is the projection 
of L onto view 3, so that any point p" coincident with the projection will form 
a matching triplet p,p' ,p" associated with a moving point tracing the line L on 

7T. 

A single contraction is a correlation mapping points to concurrent lines. Con- 
sider, for example, Hijk for some contravariant vector (a point in view 3) S. 
One can verify by substitution in eqn. 6 that the resulting matrix is if = 
where p = B'5. Let the matching points of d in views 1,2 be p,rj respectively. 
Then, by duality with we have that Et] = 0 and E~^ p = 0. Furthermore, 
Ep' is a line passing through p and A' p' in view 1. Therefore, E maps the points 
in view 2 onto concurrent lines that intersect at a hxed point p, and likewise, E~^ 
maps points in view 1 onto concurrent lines that intersect at a hxed point rj in 
view 2. Furthermore, p~^ Ep' = 0 for all pairs of p,p' on matching lines through 
the hxed points p, rj (see Fig. 5). 
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Fig. 5. A single contraction, say 5^ H,jk, is a mapping E between views 1,2 from points 
to concurrent lines. The null spaces of E and E'^ are the matching points jj, r] ot S in 
views 1,2. The image points p' are mapped by E to the lines A'p' x p and the image 
points p are mapped by E~ to the lines A'~^p x p in view 2. The bilinear relation 
pj^ Epj = 0 is satished for all pairs of pj,pj on matching lines through the hxed points 

h, n- 



The constituent homography matrices A' , B' can be extracted from the slices 
of Hijk as follows. Let Ei, E 2 , E 3 correspond to the three slices of S^EIijk by 
letting 5 be (1, 0, 0), (0, 1, 0) and (0, 0, 1) respectively. Then, A'~^ Ei + EJ A' = 0, 
i = 1,2,3. Likewise, B' satisfies such a relation on the slices 5^ Hijk, and the 
homography A'~^B from the slices S'‘ Hijk- 

In summary, the dual form of the homography tensor applies to both cases: 
optical rays meet at a single point (matching points with respect to a static 
point) and optical rays meet at a line on tt (matching points with respect to a 
moving point). In the case where no distinction can be made to the source of a 
matching triplet p,p',p" (static or moving) then we have seen that in a set of 
at least 26 such matching triplets, 16 of them must arise from moving points. 
In case that a number * < 4 of these triplets are known a-pnon to arise from 
static points, then 16 — 4x must arise from moving points. Once the dual tensor 
is recovered from image measurements it forms a mapping of both moving and 
static points and in particular can be used to distinguish between moving and 
static points (a triplet p,p',p" arising from a static point is mapped to null 
vectors p^p'^ Hijk, p^p"^ Hijk and p'^ p"^ Hijk) ■ The dual Htensor can be useful in 
practice to handle situations rich in dynamic motion seen from a monocular 
sequence. 

4 Experiments 

We conducted tests on the performance of H^^^ compared to pairwise homogra- 
phy recovery, and tests on Hijk in order to evaluate the performance on static 
and moving point configurations. The full details on the experiments of H'‘^^ can 
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be found in [9] which we will briefly summarize its conclusions here. The Hten- 
sor is recovered using standard robust estimators for least-squares estimation. 
The non-linear constraints were not taken into account. The reprojection perfor- 
mance using the recovered Htensor was consistently superior to the performance 
using a recovered homography matrix between pairs of views. On the other hand, 
we found out that recovering the constituent homography matrices from the Ht- 
ensor produced signihcantly poorer results compared to the reprojection error 
achieved by the Htensor. Our conclusion is that the recovery of the homography 
matrix from requires numerical conditioning which is beyond the scope of 
this work. It is worthwhile to note that recovering the homography matrix from 
the skew-symmetric relations with the slices of the tensor is identical to the way 
one can recover the fundamental matrix from two or more homography matrices. 
It has been shown empirically (R. Szeliski, private communication) that doing 
so for the fundamental matrix yields poor results. 

In the second experiment, displayed in Fig. 6, we created a scene with mixed 
static and moving points. The moving points were part of 4 remote controls 
that were in motion while the camera changed position from one view to the 
next. The points were tracked along the three views, without knowledge what 
is static and what is moving. The triplet of matching points were fed into a 
least-square estimation for Hijk- We then checked the error of reprojection on 
the static points — these were at sub-pixel level as can be seen in Fig. 6h — and 
the accuracy of the line trajectory of the moving points. Because the moving 
points were clustered on only 4 objects (the remote controls), then the accuracy 
was measured by “eye-balling” the parallelism of the trajectories of all points 
within a moving object. The lines are closely parallel as can be seen in Fig. 6f. 
The dual Htensor can also be used to segment the scene into static and moving 
points — this is shown in Fig. 6e. 



5 Summary 

Two views of a 3D scene are sufficient for performing reconstruction, yet there 
exist trifocal and quadrifocal tensors that concatenate 3 and 4 views and display 
an algebraic added value over 2- view reconstruction. In this paper we have done 
something similar to the well known Homography matrix — we have shown that 
there is an added value in investigating a 3-view analogue of the collineation 
operation. The resulting homography tensor and its dual Hijk are both 

intriguing and of practical value. The homography tensor places stronger con- 
straints on the mapping across three views than concatenation of pairwise ho- 
mography matrices — as evident by the coupling associated with a single linear 
system, the existence of the non-linear constraints, and the experimental results. 
This performance is comparable to the sub-space approach under inhnitesimal 
motion recently presented in [12]. 

The dual Htensor, in our mind, shows promising potential for new application 
areas and explorations in strucutre from motion. The possibility of handling 
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Fig. 6. (a),(b) two of three views of a planar scene with 4 remotes moving along straight 
lines, (c) The hrst view with the overlaid tracked points. These points were used for 
computing the dual homography tensor in a least-squares manner, (d) Segmenting the 
static from dynamic points using the recovered dual Htensor. Only the static points are 
shown, (e) The trajectory lines are overlaid on the third image — one can see that the 
lines of each remote are closely parallel thus providing an indication of accuracy of the 
dual Htensor. (f) Reprojection results using the Htensor as a point transfer mapping. 
Note that the static points are aligned with the the transferred points whereas the 
dynamic points are shifted relative to the transferred points. 
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static and dynamic points on equal grounds raises a host of new issues in which 
this paper only begins to address. 

The extension of these tensors to higher dimensions is relatively straightfor- 
ward — in that case the moving points in the dual tensor move along hyperplanes 
not lines, and the size of the tensor grows exponentially with dimension (and 
so does the number of constraints for static points). On the other hand, the 
restriction of the moving points to dimension k < n — I (in particular, k = 2 
corresponds to motion along a line) is of more practical interest. For example, 
the case of n = 4, = 2 has been explored in [11] where the application area 

is the extension of the classic 3D-to-3D alignment of point clouds to dynamic 
situations, such as when the structured light pattern attached to a sensor moves 
along with the sensor while the 3D reconstruction takes place. 
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Abstract. We study the special form that the general multi-image tensor forma- 
lism takes under the plane -l- parallax decomposition, including matching tensors 
and constraints, closure and depth recovery relations, and inter-tensor consistency 
constraints. Plane -l- parallax alignment greatly simplifies the algebra, and unco- 
vers the underlying geometric content. We relate plane -l- parallax to the geometry 
of translating, calibrated cameras, and introduce a new parallax-factorizing pro- 
jective reconstruction method based on this. Initial plane -l- parallax alignment 
reduces the problem to a single rank-one factorization of a matrix of rescaled par- 
allaxes into a vector of projection centres and a vector of projective heights above 
the reference plane. The method extends to 3D lines represented hy via-points and 
3D planes represented hy homographies. 

Keywords: Plane -l- parallax, matching tensors, projective reconstruction, facto- 
rization, structure from motion. 



1 Introduction 



This paper studies the special forms that matching tensors take under the plane + par- 
allax decomposition, and uses this to develop a new projective reconstruction method 
based on rank-1 parallax factorization. The main advantage of the plane + parallax 
analysis is that it greatly simplifies the usually rather opaque matching tensor algebra, 
and clarifies the way in which the tensors encode the underlying 3D camera geometry. 
The new plane + parallax factorizing reconstruction method appears to be even stabler 
than standard projective factorization, especially for near-planar scenes. It is a one-step, 
closed form, multi-point, multi-image factorization for projective structure, and in this 
sense improves on existing minimal-configuration and iterative depth recovery plane H- 
parallax SFM methods II 9I4I23I'22I4.5II . As with standard projective factorization ||37|. 
it can be extended to handle 3D lines (via points) and planes (homographies) alongside 
3D points. 

Matching tensors 1129181361 are the image signature of the camera geometry. Given 
several perspective images of the same scene taken from different viewpoints, the 3D 
camera geometry is encoded by a set of 3 x 4 homogeneous camera projection matrices. 
These depend on the chosen 3D coordinate system, but the dependence can be elimi- 
nated algebraically to give four series of multi-image tensors (multi-index arrays of 
components), each interconnecting 2-4 images. The different images of a 3D feature are 
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constrained by multilinear matching relations with the tensors as coefficients. These 
relations can be used to estimate the tensors from an initial set of correspondences, and 
the tensors then constrain the search for further correspondences. The tensors implicitly 
characterize the relative projective camera geometry, so they are a useful starting point 
for 3D reconstruction. Unfortunately, they are highly redundant, obeying a series of 
complicated internal self-consistency constraints whose general form is known but too 
complex to use easily, except in the simplest cases 11361.51171611 . 

On the other hand, a camera is simply a device for recording incoming light in various 
directions at the camera’s optical centre. Any two cameras with the same centre are 
equivalent in the sense that — modulo field-of-view and resolution constraints which 
we ignore for now — they see exactly the same set of incoming light rays. So their 
images can be warped into one another by a 1-1 mapping (for projective cameras, a 2D 
homography). Anything that can be done using one of the images can equally well be 
done using the other, if necessary by pre-warping to make them identical. 

From this point of view, it is clear that the camera centres are the essence of the 3D 
camera geometry. Changing the camera orientations or calibrations while leaving the 
centres fixed amounts to a ‘trivial’ change of image coordinates, which can be undone at 
any time by homographic (un)warping. In particular, the algebraic structure (degeneracy, 
number of solutions, etc.) of the matching constraints, tensors and consistency relations 

— and a fortiori that of any visual reconstruction based on these — is essentially a 3D 
matter, and hence depends only on the camera centres. 

It follows that much of the complexity of the matching relations is only apparent. At 
bottom, the geometry is simply that of a configuration of 3D points (the camera centres). 
But the inclusion of arbitrary calibration-orientation homographies everywhere in the 
formulae makes the algebra appear much more complicated than need be. One of the 
main motivations for this work was to study the matching tensors and relations in a case 

— that of projective plane + parallax alignment — where most of the arbitrariness due 
to the homographies has been removed, so that the underlying geometry shows up much 
more clearly. 

The observation that the camera centres lie at the heart of the projective camera 
geometry is by no means new. It is the basis of Carlsson’s ‘duality’ between 3D points 
and cameras (i.e. centres) 1121431311(111 . and ofHeyden&Astrdm’s closely related ‘reduced 
tensor’ approach 1 1 31 1 41 1 .51 1 VI . The growing geometry tradition in the plane + parallax 
literature III S)l23l'2'2l4i4.58l is also particularly relevant here. 



Organization: ^introduces our plane + parallax representation and shows how it 
applies to the basic feature types ; ^Sldisplays the matching tensors and constraints in the 
plane + parallax representation ; ^discusses tensor scaling, redundancy and consistency ; 
0 considers the tensor closure and depth recovery relations under plane + parallax ; 
0 introduces the new parallax factorizing projective reconstruction method; 0 shows 
some initial experimental results; and ^concludes. 



Notation: Bold italic ‘x’ denotes 3-vectors, bold sans-serif ‘x’ 4- vectors, upper case 
‘H,H’ matrices, Greek ‘A, /r’ scalars (e.g. homogeneous scale factors). We use homoge- 
neous coordinates for 3D points X and image points x, but usually inhomogeneous ones 
c for projection centres C = ( 1 )• We use P for 3 x 4 camera projection matrices, e 
for epipoles. 3D points X = ( ^ ) are parametrized by a point x on the reference plane 
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and a ‘projective height’ w above it. A denotes cross-product, [ — the associated 3x3 
skew matrix j = jc Aj, and [a,b,c] the triple product. 

2 The Plane + Parallax Representation 

Data analysis is often simplified by working in terms of small corrections against a refe- 
rence model. Image analysis is no exception. In plane + parallax, the reference model 
is a real or virtual reference plane whose points are held fixed throughout the image 
sequence by image warping (see, e.g. and their references). The reference 

is often a perceptually dominant plane in the scene such as the ground plane. Points 
that lie above the plane are not exactly fixed, but their motion can be expressed as a 
residual parallax with respect to the plane. The parallax is often much smaller than the 
uncorrected image motion, particularly when the camera motion is mainly rotational. 
This simplifies feature extraction and matching. For each projection centre, alignment 
implicitly defines a unique reference orienfafion and calibration, and in this sense enti- 
rely cancels any orientation and calibration variations. Moreover, the residual parallaxes 
directly encode useful structural information about the size of the camera translation 
and the distance of the point above the plane. So alignment can be viewed as a way of 
focusing on the essential 3D geometry — the camera centres and 3D points — by elimi- 
nating the ‘nuisance variables’ associated with orientation and calibration. The ‘purity’ 
of the parallax signal greatly simplifies many geometric computations. In particular, we 
will see that it dramatically simplifies the otherwise rather cumbersome algebra of the 
matching tensors and relations (c.f. also rmm ). 

The rest of this section describes our “plane at infinity -i- parallax” representation. It 
is projectively equivalent to the more common “ground plane + parallax” representation 
(e.g. 1191421 1. but has algebraic advantages — simpler formulae for scale factors, and 
the link to translating cameras — that will be discussed below. 

Coordinate frame : We suppose given a 3D reference plane with a predefined projec- 
tive coordinate system, and a 3D reference point not on the plane. The plane may be 
real or virtual, explicit or implicit. The plane coordinates might derive from an image 
or be defined by features on the plane. The reference point might be a 3D point, a pro- 
jection centre, or arbitrary. We adopt a projective 3D coordinate system that places the 
reference point at the 3D origin (0 0 01)^, and the reference plane at infinity in standard 
position {i.e. its reference coordinates coincide with the usual coordinates on the plane 
at infinity). Examining the possible residual 4x4 homographies shows that this fixes 

fhe 3D projective frame up fo a single global scale factor. If H = ( ^ ^ > then the 

constraint that H fixes each point ( q ) on the reference plane implies that A = I 
and b = 0, and the constraint that H fixes fhe origin ( ^ ) implies that t = 0. So 
H = ^ ^ , which is a global scaling by /i/A. 

3D points : 3D points are represented as linear combinations of the reference point/origin 
and a point on the reference plane : 
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jt: is the intersection with the reference plane, of the line through X and the origin, w is 
called x’s projective height above the plane, tn = 0 is the reference plane, w = oo the 
origin, w depends on the normalization convention for jc. If the reference plane is made 
finite {z = 0) by interchanging z and w coordinates, w becomes the vertical height above 
the plane. But in our projective, plane-at-infinity based frame with affine normalization 
Xz = 1, w is the inverse z-distance (or with spherical normalization ||jc|| = 1, the inverse 
“Euclidean” distance) of X from the origin. 

Camera matrices: Plane + parallax aligned cameras fix the image of the reference 
plane, so their leading 3x3 submatrix is the identity. They are parametrized simply by 
their projection centres : 

P = (u/ 3 x 3 — c) with projection centre C = (2) 

Hence, any 3D point can be viewed as a plane + parallax aligned camera and vice versa. 
But, whereas points often lie on or near the reference plane (w 0), cameras centred 
on the plane (u 0) are too singular to be useful — they project the entire 3D scene to 
their centre point c. 

We will break the nominal symmetry between points and cameras. Points will be trea- 
ted projectively, as general homogeneous 4-component quantities with arbitrary height 
component w. But camera centres C = ( ^ ) will be assumed to lie outside the reference 
plane and scaled affinely (u — ?> 1), so that they and their camera matrices P = {u I — c) 
are parametrized by their inhomogeneous centre 3-vector c alone. 

This asymmetry is critical to our approach. Our coordinate frame and reconstruction 
methods are essentially projective and are most naturally expressed in homogeneous 
coordinates. Conversely, scaling u to 1 freezes the scales of the projection matrices, and 
everywhere that matching tensors are used, it converts formulae that would be bilinear or 
worse in the c’s and u’s, to ones that are merely linear in the c’s. This greatly simplifies the 
tensor estimation process compared to the general unaligned case. The representation 
becomes singular for cameras near the reference plane, but that is not too much of a 
restriction in practice. In any case it had to happen — no minimal linear representation 
can be globally valid, as the general redundant tensor one is. 

Point projection: In image i, the image Xp of a 3D point Xp = ( Wp ) is displaced 
linearly from its reference imageQjCp towards the centre of projection Ci , in proportion 
to its height Wp : 



^ipXip — Pj Xp — I — ■t'p ^pf'i (3) 

\WpJ 

Here A^p is a projective depth 1321371 — an initially-unknown projective scale factor 
that compensates for the loss of the scale information in P^ Xp when Xip is measured in 
its image. Although the homogeneous rescaling freedom of Xp makes them individually 
arbitrary, the combined projective depths of a 3D point — or more precisely its vector 

* The origin/reference point need not coincide with a physical camera, but can still be viewed as 
a reference camera Pq = (f 1^) > projecting 3D points Xp to their reference images Xp. 
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of rescaled image points {XipXip)^^■^ ^ — implicitly define its 3D structure. This 
is similar to the general projective case, except that in plane + parallax the projection 
matrix scale freedom is already frozen : while the homogeneous scale factors of Xp and 
Xip are arbitrary, Ci has a fixed scale linked to its camera’s position. 

Why not a ground plane? : In many applications, the reference plane is nearby. Pushing 
it out to infinity forces a deeply projective 3D frame. It might seem preferable to use 
a finite reference plane, as in, e.g. iiinEi- For example, interchanging the z and w 
coordinates puts the plane at z = 0, the origin at the vertical infinity (0 0 1 0)^, and 
(modulo Euclidean coordinates on the plane itself) creates an obviously rectilinear 3D 
coordinate system, where x gives the ground coordinates and w the vertical height above 
the plane. However, a finite reference plane would hide a valuable insight that is obvious 
from 0 : The plane + parallax aligned camera geometry is projectively equivalent to 
translating calibrated cameras. Any algorithm that works for these works for projective 
plane + parallax, and (up to a 3D projectivity !) vice versa. Although not new (see, e.g. 
[031), this analogy deserves to be better known. It provides simple algebra and geometric 
intuition that were very helpful during this work. It explicitly realizes — albeit in a weak, 
projectively distorted sense, with the reference plane mapped to infinity — the suggestion 
that plane + parallax alignment cancels the orientation and calibration, leaving only the 
translation E2i- 

3D Lines : Any 3D line L can be parametrized by a homogeneous 6-tuple of Pliicker 

coordinates (l,z) where; (;) Z is a line 3-vector — L’s projection from the origin onto 

the reference plane ; (ii) z is a point 3-vector — L’s intersection with the reference plane ; 
(Hi) z lies on /, I z = 0 (this is the Pliicker constraint) ; (;v) the relative scaling of / and 
z is fixed and gives L’s ‘steepness’: lines on the plane have z ^ 0, while the ray from 
the origin to z has 1^0. This parametrization of L relates to the usual 3D projective 
Pliicker (4x4 skew rank 2 matrix) representations as follows : 

I * _ ( [^]x 1 contravariant ■ _ ( [^]x 1 covariant 

“ y 0 y form * ~ y —I Q J form ' 

The line from ( ^ ) to ( ^ ) is (Z,z) = (x Ay, wy — vx). A 3D point x = ( ^ ) lies 

on L iff L, X = = 0. In a camera at Ci, L projects to: 

p-ih = l + zACi ( 5 ) 

This vanishes if lies on L. 

Displacements and epipoles : Given two cameras with centres ) and Cj = 

( ), the 3D displacement vector between their two centres is Cij = Ci — Cj. The 

scale of Cij is meaningful, encoding the relative 3D camera position. Forgetting this 
scale factor gives the epipole Cij — the 2D projective point at which the ray from Cj to 
Ci crosses the reference plane : 



^ij — ('ij — 



Ci Cj 



(6) 
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We will see below that it is really the inter-camera displacements and not the epipoles 
Cij that appear in tensor formulae. Correct relative scalings are essential for geometric 
coherence, but precisely because of this they are also straightforward to estimate. Once 
found, the displacements Cij amount to a reconstruction of the plane + parallax aligned 
camera geometry. To find a corresponding set of camera centres, simply fix the 3D 
coordinates of one centre (or function of the centres) arbitrarily, and the rest follow 
immediately by adding displacement vectors. 

Parallax: Subtracting two point or line projection equations OS gives the following 
important parallax equations; 



Given the correct projective depths A, /i, the relative parallax caused by a camera dis- 
placement is proportional to the displacement vector. The RHS of o already suggests 
the possibility of factoring a multi-image, multi-point matrix of rescaled parallaxes into 
{w) and (cij) matrices. Results equivalent to O appear in albeit with more 

complicated scale factors owing to the use of different projective frames. 

Equation o has a trivial interpretation in terms of 3D displacements. For a point 
X = ( ^ ) above the reference plane, scaling to m = 1 gives projection equations 
XiXi = Pi X = jc — Ci, so XiXi is the 3D displacement vector from Ci to X. (7j| 
just says that the sum of displacements around the 3D triangle Ci Cj X vanishes. On the 
reference plane, this entails the alignment of the 2D points Xi, Xj and Cij (along the line 
of intersection of the 3D plane of Ci Cj X with the reference plane — see fig. QJ, and 
hence the vanishing of the triple product [xi,eij,xj ] = 0. However the 3D information 
in the relative scale factors is more explicit in (Q. 

3D Planes: The 3D plane p = {n^ d) has equation px = n jc-|-(i?ii = 0. It 
intersects the reference plane in the line « • jr = 0. The relative scaling of n and d gives 
the ‘steepness’ of the 3D plane; n — 0 for the reference plane, d = 0 for planes through 
the origin. A point in image j back-projects to the 3D point Bj Xj on p, which induces 
an image j to image i homography Hij , where ; 



For any i,j and any p, this fixes the epipole Cij and each point on the intersection line 
M X = 0. is actually a planar homology 13(11 111 — it has a double eigenvalue 
corresponding to the points on the fixed line. 

Any chosen plane p = (m^ d) can be made the reference plane by applying a 3D 
homography H and compensating image homographies Hi ; 



XiXi — XjXj = —wcij 

— Z A Cij 



(7) 



( 8 ) 





(9) 
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Reference positions jc are unchanged, projective heights are warped by an affinity w 
w + n ■ x/d, and camera centres are rescaled c n-c+d ^ (infinitely, if they lie on the 
plane p) : 



x = (i) 

p = (n^ d) 

P^ = (/ -c.) 



H X — w+n-x/d ) 


(10) 


pH'i = (0 d) 


(11) 


- (i 

V n-c^+dj 


(12) 



When p is the true plane at infinity, the 3D frame becomes affine and the aligned camera 
motion becomes truly translational. 

Given multiple planes Pk and images i, and choosing some fixed base image 0, the 
3 columns of each /fio(Pfc) can be viewed as three point vectors and incorporated into 
the rank-one factorization method below to reconstruct the Cio and nj / (n^ • Cq + dk)- 
Consistent normalizations for the different Hio are required. If Cio is known, the correct 
normalization can be recovered from [ ]x ^io = [ ^io ]x ■ This amounts to the point 
depth recovery equation (TT^ below applied to the columns of Hio and Hqq = I. Alterna- 
tively, HiQ — I + . . . has two repeated unit eigenvalues, and the right (left) eigenvectors 
of the remaining eigenvalue are («^)- This allows the normalization, epipole and 
plane normal to be recovered from an estimated Hio. Less compact rank 4 factorization 
methods also exist, based on writing Hio as a 9-vector, linear in the components of I and 
either Cio or Mfc l2S144i4.5ll. 

Carlsson duality: Above we gave the plane + parallax correspondence between 3D 
points and (the projection centres of aligned) cameras I10I22I : 



X = 



P = [wl — x) 



Carlsson Ql (see also 11311414311014^ ) defined a related but more ‘twisted’ duality 
mapping based on the alignment of a projective basis rather than a plane: 



X = 



P = 



l/rr 



— 1/w 
1/y -1/w 

1/z -1/w 



{wl — x) 



Provided that x, y, z are non-zero, the two mappings differ only by an image homography. 
Plane + parallax aligns a 3D plane pointwise, thus forcing the image —x of the origin to 
depend on the projection centre. Carlsson aligns a 3D projective basis, fixing the image 
of the origin and just 3 points on the plane (and incidentally introducing potentially 
troublesome singularities for projection centres on the x, y and z coordinate planes, as 
well as on the w = 0 one). In either case the point-camera “duality” (isomorphism would 
be a better description) allows some or all points to be treated as cameras and vice versa. 
This has been a fruitful approach for generating new algorithms IP.I43l3lini42i1 917.2141 ■ 
All of the below formulae can be dualized, with the proviso that camera centres should 
avoid the reference plane and be affinely normalized, while points need not and must be 
treated homogeneously. 
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3 Matching Tensors and Constraints in Plane + Parallax 



Matching Tensors: The matching tensors for aligned projections are very simple func- 
tions of the scaled epipoles / projection centre displacements. From a tensorial point of 
view0, the simplest way to derive them is to take the homography-epipole decomposi- 
tions of the generic matching tensors rmm . and substitute identity matrices for the 
homographies : 



Cl2 

F 12 

TP 



qA, 



A2A3 A4 



Cl -C2 
[Ci2]x = [ft 
1\ ® Ci3 - C 12 ® Ii 



displacement from C 2 to Ci 
image 1-2 fundamental matrix 
image 1-2-3 trifocal tensor 
image 1 -2-3-4 quadrifocal tensor 



The plane + parallax fundamental matrix and trifocal tensor have also been studied in 
II22E1- The use of affine scaling — )> 1 for the centres ) is essential here, 

otherwise T is bilinear and Q quadrilinear in c, tt. 

Modulo scaling, C 12 is the epipole Ci 2 — the intersection of the ray from C 2 to Ci with 
the reference plane. Coherent relative scaling of the terms of the trifocal and quadrifocal 
tensor sums is indispensable here, as in most other multi-term tensor relations. But for 
this very reason, the correct scales can be found using these relations. As discussed above, 
the correctly scaled ’s characterize the relative 3D camera geometry very explicitly, 
as a network of 3D displacement vectors. It is actually rather misleading to think in 
terms of epipolar points on the reference plane: the Cij are neither estimated {e.g. from 
the trifocal tensor) nor used {e.g. for reconstruction) like that, and treating their scale 
factors as arbitrary only confuses the issue. 

Matching constraints : The first few matching relations simplify as follows : 



[Xi,Ci 2 ,X 2 ] = 0 


epipolar point 


(13) 


(xi A X 2 ) (ci 3 AJC 3 )^ - (ci 2 AX 2 ) (xi AX 3 )^ = 0 


trifocal point 


(14) 


{h A h) {h ■ fia) ~ {h ■ ^ 12 ) (^1 A Z 3 ) = 0 


trifocal line 


(15) 


{I 2 ■ Xi) (Z 3 • C 13 ) — {I 2 ■ C 12 ) {h ■ Xi) = 0 


trifocal point-line 


(16) 


{h A I 3 ) (/i • C 14 ) + {I 3 A li) {I 2 ■ C 24 ) 






+ (Zi A I 2 ) {h • C 34 ) = 0 


quadrifocal 3-line 


(17) 


Equation (TT^ is the primitive trifocal constraint. Given three images Jrqi=i ...3 of a 3D 
point X, and arbitrary image lines hJs through JC 2 , X 3 , (HTni asserts that the 3D optical 



ray of X\ meets the 3D optical planes of hjs in a common 3D point (x). The tri- and 
quadrifocal 3-line constraints ( I16I17D both require that the optical planes of li,l 2 ,ls 

^ There is no space here to display the general projective tensor analogues of the plane -1- parallax 
expressions given here and below — see 15 . 51 . 
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Fig. 1 . The geometry of the trifocal constraint. 



intersect in a common 3D line. The quadrifocal 4-point constraint is straightforward but 
too long to give here. 

The trifocal point constraint contains I20I3.5I22I4I two epipolar constraints in the 
formjc Ajc' c Ax', plus a proportionality-of-scale relation for these parallel 3-vectors: 



(X1AX2) : (C12AJC2) = (jCiAxa) : (C13AX3) ( 18 ) 

The homogeneous scale factors of the x’s cancel. This equation essentially says that X3 
must progress from 613 to Xi in step with X 2 as it progresses from ^12 to Xi (and both 
in step with X as it progresses from Ci to Xi on the plane — see fig. [fll. In terms of 
3D displacement vectors c and Ax (or if the figure is projected generically into another 
image), the ratio on the LHS of ca is 1, being the ratio of two different methods of 
calculating the area of the triangle Ci C2 X. Similarly for the RHS with Ci C3 X. Both sides 
involve X, hence the lock-step. 

Replacing the lines in the line constraints (I15ll6ll7t with corresponding tangents to 
iso-intensity contours gives tensor brightness constraints on the normal flow at a point. 
The Hanna-Okamoto-Stein-Shashua brightness constraint (C3 predominates for small, 
mostly-translational image displacements like residual parallaxes fTim But for more 
general displacements, the 3 line constraints give additional information. 



4 Redundancy, Scaling, and Consistency 

A major advantage of homography-epipole parametrizations is the extent to which 
they eliminate the redundancy that often makes the general tensor representation rat- 
her cumbersome. With plane + parallax against a fixed reference plane, the redun- 
dancy can be entirely eliminated. The aligned m camera geometry has 3 to — 4 d.o.f.: 
the positions of the centres modulo an arbitrary choice of origin and a global sca- 
ling. These degrees of freedom are explicitly parametrized by, e.g., the displacements 
I i=2...m> again modulo global rescaling. The remaining displacements can be found 
from Cij = Ci — Cj = Cn — Cji, and all of the matching tensors are simple linear 
functions of these. Conversely, the matching constraints are linear in the tensors and 
hence in the basic displacements Cn, so the complete vector of basic displacements 
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Fig. 2. The various image projections of each triplet of 3D points and/or camera centres are in 
Desargues correspondence E0- 

with the correct relative scaling can be estimated linearly from image correspondences. 
These properties clearly simplify reconstruction. They are possible only because plane 
+ parallax is a local representation — unlike the general, redundant tensor framework, 
it becomes singular whenever a camera approaches the reference plane. However, the 
domain of validity is large enough for most real applications. 

Consistency relations: As above, if they are parametrized by an independent set of 
inter-centre displacements, individual matching tensors in plane + parallax have no 
remaining internal consistency constraints and can be estimated linearly. The inter- 
tensor consistency constraints reduce to various more or less involved ways of enforcing 
the coincidence of versions of the same inter-camera displacement vector Cij derived 
from different tensors, and the vanishing of cyclic sums of displacements: 

Cji A Cij — 0 Cij A {Cij^ — 0 Cij ~\~ Cj]^ -f Cfc/ -p . . . -p Crni — 0 

In particular, each cyclic triplet of non-coincident epipoles is not only aligned, but has 
a unique consistent relative scaling Cij = \ij Cij : 



This and similar cyclic sums can be used to linearly recover the missing displacement 
scales. However, this fails if the 3D camera centres are aligned: the three epipoles coin- 
cide, so the vanishing of their cyclic sum still leaves 1 d.o.f. of relative scaling freedom. 
This corresponds to the well-known singularity of many fundamental matrix based re- 
construction and transfer methods for aligned centres |f4-()|. Trifocal or observation (depth 
recovery) based methods II32I3 711 must be used to recover the missing scale factors in 
this case. 

The cyclic triplet relations essentially encode the coplanarity of triplets of optical 
centres. All three epipoles lie on the line of intersection of this plane with the reference 
plane. Also, the three images of any fourth point or camera centre form a Desargues 
theorem configuration with the three epipoles (see fig.0. A multi-camera geometry in- 
duces multiple, intricately interlocking Desargues configurations — the reference plane 
‘signature’ of its coherent 3D geometry. 
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5 Depth Recovery and Closure Relations 

Closure relations: In the general projective case, the closure relations are the bilinear 
constraints between the (correctly scaled) matching tensors and the projection matrices, 
that express the fact that the former are functions of the latter nwn . Closure based 
reconstruction 1381401 uses this to recover the projection matrices linearly from the 
matching tensors. In plane + parallax, the closure relations trivialize to identities of the 
form Cij A (ci — Cj) = 0 (since = Ci~ Cj). Closure based reconstruction just reads 
off a consistent set of c^’s from these linear constraints, with an arbitrary choice of origin 
and global scaling. Ci = Cn is one such solution. 

Depth recovery relations: Attaching the projection matrices in the closure relations 
to a 3D point gives depth recovery relations linking the matching tensors to correctly 
scaled image points 13.513 2.'4()1 . These are used, e.g. for projective depth (scale factor) 
recovery in factorization based projective reconstruction It! '21371 . For plane + parallax 
registered points and lines with unknown relative scales, the hrst few depth recovery 
relations reduce to : 



These follow immediately from the parallax equations (I/I8I) . As before, the trifocal point 
relations contain two epipolar ones, plus an additional relative vector scaling proportio- 
nality: {XiXi- XjXj) : Cij = {XiXi- XkXk) : Cifc.Seehg.Ql 

6 Reconstruction by Parallax Factorization 

Now consider factorization based projective reconstruction under plane + parallax. Re- 
call the general projective factorization reconstruction method II32I3 /II : m cameras with 
3x4 camera matrices | view n 3D points Xp | p=i..,n to produce mn image 

points XipXip — Pi Xp. These projection equations can be gathered into a 2>m x n 
matrix : 



So the (Ajc) matrix factorizes into rank 4 factors. Any such factorization amounts to 
a projective reconstruction: the freedom is exactly a 4 x 4 projective change of coor- 
dinates H, with Xp — > H Xp and Pi — Pi With noisy data the factorization is 
not exact, but we can use a numerical method such as truncated SVD to combine the 
measurements and estimate an approximate factorization and structure. To implement 
this with image measurements, we need to recover the unknown projective depths (scale 
factors) Aip. For this we use matching tensor based depth recovery relations such as 



Cij A (AiJCi — XjXj) = 0 epipolar (19) 



Cij {XkXk- XiXi)^ - {XjXj - XiXi) (cife)^ = 0 trifocal (20) 

{jiih - Hjlj) ■ Cij = 0 line (21) 




(22) 
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Fij i^jpXjp) — Cji A {XipXip) IL3 51.321371 . Rescaling the image points amounts to an 
implicit projective reconstruction, which the factorization consolidates and concretizes. 
For other factorization based SFM methods, see (among others) mmmsm . 
Plane + parallax point factorization: The general rank 4 method continues to work 
under plane + parallax with aligned points Xip , but in this case a more efficient rank 1 
method exists, that exploits the special form of the aligned projection matrices : 



1. Align the mn image points to the reference plane and (as for the general-case 
factorization) estimate their scale factors Xip by chaining together a network of 
plane + parallax depth recovery relations (O or G3)- 

2. Choose a set of arbitrary weights pi with pi = 1. We will work in a 3D frame 

based at the weighted average of the projection centres: i.e. c — Pi ^i "'ll! 

be set to 0 . For the experiments we work in an average-of-centres frame pi = —. 
Alternatively, we could choose some image j as a base image, pi = Sij . 

3. Calculate the weighted mean of the rescaled images of each 3D point, and their 
residual parallaxes relative to this in each image. The theoretical values are given 
for reference, based on © and our choice of frame c ^ 0 : 

Xp = Pi {XipXip^ ~ Xp WpC ^ Xp (23) 

^Xip — Xip Xip Xp ~ (c^ c) Wp ^ Ci Wp (24) 



4. Factorize the combined residual parallax matrix to rank 1, to give the projection 
centres Ci and point depths Wp, with their correct relative scales: 



( Sxii . 








Sx i 


^ j 



(wi ... Wn) 



(25) 



5. 



The ambiguity in the factorization is a single global scaling Ci — >■ pCi,Wp -A Wp/p 
(the length scale of the scene). 

The hnal reconstructions are Pi = (l — Ci) and Xp = f V 



This process requires the initial plane + parallax alignment, and estimates of the epipoles 
for projective depth recovery. It returns the 3D structure and camera centres in a projective 
frame that places the reference plane at infinity and the origin at the weighted average 
of camera centres. 

With affine coordinates on the reference plane, the heights Wp reduce to inverse 
depths 1/Zp (w.r.t. the projectively distorted frame). Several existing factorization based 
SFM methods try to cancel the camera rotation and then factor the resulting translational 
motion into something like (inverse depth) -(translation), e.g. rr2E^ii . Owing to per- 
spective effects, this is usually only achieved approximately, which leads to an iterative 
method. Here we require additional knowledge — a known, alignable reference plane 
and known epipoles for depth recovery — and we recover only projective structure, but 
this allows us to achieve exact results from perspective images with a single non-iterative 
rank 1 factorization. It would be interesting to investigate the relationships between our 
method and , but we have not yet done so. 
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Line Factorization: As in the general projective case, lines can be integrated into the 
point factorization method using via points. Each line is parametrized by choosing two 
arbitrary (but well-spaced) points on it in one image. The corresponding points on other 
images of the line are found by epipolar or trifocal point transfer, and the 3D via points 
are reconstructed using factorization. It turns out that the transfer process automatically 
gives the correct scale factor (depth) for the via points : 

_ ^ A {F ij Xj ) general _ . plane -H parallax 

- li-eji case ~ lye^j case 

Under plane + parallax, all images z A Ci + / of a line {I, z) intersect in a common point 
z. If we estimate this first, only one additional via point is needed for the line. 

Plane factorization : As mentioned in inter-image homographies Hio induced by 3D 

planes against a fixed base image 0 can also be incorporated in the above factorization, 
simply by treating their three columns as three separate point 3-vectors. Under plane 
+ parallax, once they are scaled correctly as in m the homographies take the form 
©. Averaging over i as above gives an Hio of the same form, with Cio replaced by 
c — Cq — Cq. So the corresponding “homography parallaxes” SHig — factor 

as for points, with ^ in place of Wp. Alternatively, if Cq is taken as origin and the 
S’s are measured against image 0, 1 rather than is subtracted. 

Optimality properties: Ideally, we would like our structure and motion estimates to 
be optimal in some sense. For point estimators like maximum likelihood or MAP, this 
amounts to globally minimizing a measure of the (robustified, covariance-weighted) 
total squared image error, perhaps with overfitting penalties, etc. Unfortunately — as 
with all general closed-form projective SFM methods that we are aware of, and notwith- 
standing its excellent performance in practice — plane + parallax factorization uses an 
algebraically simple but statistically suboptimal error model. Little can be done about 
this, beyond using the method to initialize an iterative nonlinear refinement procedure 
(e.g. bundle adjustment). As in other estimation problems, it is safest to refine the results 
after each stage of the process, to ensure that the input to the next stage is as accurate 
and as outlier-free as possible. But even if the aligning homographies are refined in this 
way before being used (c.f. IlhHH 11 V the projective centering and factorization steps 
are usually suboptimal because the projective rescaling A^p ^ 1 skews the statistical 
weighting of the input points. In more detail, by pre-weighting the image data matrix 
before factorization, affine factorization 01 can be generalized to give optimal results 
under an image error model as general as a per-image covariance times a per-3D-point 
weighjE But this is no longer optimal in projective factorization: even if the input er- 

^ l.e. image point has covariance Pp Ci, where Ci is a fixed covariance matrix for image i and 
Pp a fixed weight for 3D point p. Under this error model, factoring the weighted data matrix 
(pp into weighted camera matrices Pi and 3D point vectors Xp 

gives statistically optimal results. Side note: For typical images at least 90-95% of the image 
energy is in edge-like rather than comer-like structures (“the aperture problem”). So assuming 
that the (residual) camera rotations are small, an error model that permitted each 3D point to 
have its own highly anisotropic covariance matrix would usually be more appropriate than a 
per-image covariance. Irani & Anandan iJQij go some way towards this by introducing an initial 
reduction based on a higher rank factorization of transposed weighted point vectors. 
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Reconstruction Error vs. Image Noise 
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Fig. 3. A comparison of 3D reconstruction errors for plane + parallax SFM factorization, funda- 
mental matrix based projective factorization E203. and projective bundle adjustment. 



rors are uniform, rescaling by the non-constant factors distorts the underlying error 
model. In the plane + parallax case, the image rectification step further distorts the error 
model whenever there is non-negligible camera rotation. In spite of this, our experiments 
suggest that plane -F parallax factorization gives near-optimal results in practice. 

7 Experiments 

Figure 0 compares the performance of the plane -F parallax point factorization method 
described above, with conventional projective factorization using fundamental matrix 
depth recovery II3'2I37II . and also with projective bundle adjustment initialized from the 
plane -F parallax solution. Cameras about 5 radii from the centre look inwards at a 
synthetic spherical point cloud cut by a reference plane. Half the points (but at least 
4) lie on the plane, the rest are uniformly distributed in the sphere. The image size is 
512 X 512, the focal length 1000 pixels. The cameras are uniformly spaced around a 
90° arc centred on the origin. The default number of views is 4, points 20, Gaussian 
image noise 1 pixel. In the scene flatness experiment, the point cloud is progressively 
flattened onto the plane. The geometry is strong except under strong flattening and for 
small numbers of points. 
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The main conclusions are that plane + parallax factorization is somewhat more 
accurate than the standard fundamental matrix method, particularly for near planar scenes 
and linear fundamental matrix estimates, and often not far from optimal. In principle this 
was to be expected given that plane + parallax applies additional scene constraints (known 
coplanarity of some of the observed points). However, additional processing steps are 
involved (plane alignment, point centring), so it was not clear a priori how effectively 
the coplanarity constraints could be used. In fact, the two factorizations have very similar 
average reprojection errors in all the experiments reported here, which suggests that the 
additional processing introduces very little bias. The plane + parallax method’s greater 
stability is confirmed by the fact that its factorization matrix is consistently a little better 
conditioned than that of the fundamental matrix method (i.e. the ratio of the smallest 
structure to the largest noise singular value is larger). 

8 Summary 

Plane + parallax alignment greatly simplifies multi-image projective geometry, reducing 
matching tensors and constraints, closure, depth recovery and inter-tensor consistency 
relations to fairly simple functions of the (correctly scaled!) epipoles. Choosing projec- 
tive plane + parallax coordinates with the reference plane at infinity helps this process 
by providing a (weak, projective) sense in which reference plane alignment cancels out 
precisely the camera rotation and calibration changes. This suggests a fruitful analogy 
with the case of translating calibrated cameras and a simple interpretation of plane H- 
parallax geometry in terms of 3D displacement vectors. 

The simplified parallax formula allows exact projective reconstruction by a simple 
rank-one (centre of projection) (height) factorization. Like the general projective fac- 
torization method CT71 . an initial scale recovery step based on estimated epipoles 
is needed. When the required reference plane is available, the new method appears to 
perform at least as well as the general method, and significantly better in the case of near- 
planar scenes. Lines and homography matrices can be integrated into the point-based 
method, as in the general case. 

Future work: We are still testing the plane + parallax factorization and refinements are 
possible. It would be interesting to relate it theoretically to affine factorization fSj, and 
also to Oliensis’s family of bias-corrected rotation-cancelling multiframe factorization 
methods 12.51261 . Bias correction might be useful here too, although our centred data is 
probably less biased than the key frames of 112512611 . 

The analogy with translating cameras is open for exploration, and more generally, the 
idea of using a projective choice of 3D and image frames to get closer to a situation with 
a simple, special-case calibrated method, thus giving a simplified projective one. E.g. we 
find fhat suifable projecfive rectification of the images often makes affine factorizafion 
m much more accurafe as a projective reconsfruction mefhod. 

One can also consider aufocalibrafion in fhe plane + parallax framework. If is easy 
fo derive analogues of ED (if only sfructure on the reference plane is used), or ltT6l3^ 
(if the off-plane parallaxes are used as well). But so far this has not lead to any valuable 
simplifications or insights. Reference plane alignment distorts the camera calibrations, 
so the aligning homographies can not (immediately) be eliminated from the problem. 



Plane + Parallax, Tensors and Factorization 



537 



References 

[1] A. Capel, D. and Zisserman. Automated mosaicing with super-resolution zoom. In 
hit. Conf. Computer Vision & Pattern Recognition, pages 885-891, June 1998. 

[2] S. Carlsson. Duality of reconstruction and positioning from projective views. In P. Anandan, 
editor, IEEE Workshop on Representation of Visual Scenes. IEEE Press, 1995. 

[3] S. Carlsson and D. Weinshall. Dual computation of projective shape and camera positions 
from multiple images. Int.J. Computer Vision, 27(3):227-241, May 1998. 

[4] A. Criminisi, I. Reid, and A. Zisserman. Duality, rigidity and planar parallax. In European 
Conf. Computer Vision, pages 846-861. Springer- Verlag, 1998. 

[5] O. Faugeras and B. Mourrain. On the geometry and algebra of the point and line correspon- 
dences between n images. In Int. Conf. Computer Vision, pages 951-6, 1995. 

[6] O. Faugeras and T. Papadopoulo. Grassmann-Cayley algebra for modeling systems of ca- 
meras and the algebraic equations of the manifold of trifocal tensors. Transactions of the 
Royal society A, 1998. 

[7] K. Hanna and N. Okamoto. Combining stereo and motion analysis for direct estimation of 
scene structure. In Int. Conf. Computer Vision & Pattern Recognition, pages 357-65, 1993. 

[8] R. Hartley. Lines and points in three views and the trifocal tensor. Int.J. Computer Vision, 
22(2): 125-140, 1997. 

[9] R. Hartley. Self calibration of stationary cameras. Int.J. Computer Vision, 22(l):5-23, 1997. 

[10] R. Hartley and G.Debunne. Dualizing scene reconstruction algorithms. InR. Koch and L. Van 
Gool, editors. Workshop on 3D Structure from Multiple Images of Large-scale Environments 
SMILE’98, pages 14-31. Springer- Verlag, 1998. 

[11] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge 
University Press, 2000. 

[12] D. Heeger and A. Jepson. Subspace methods for recovering rigid motion I: Algorithm and 
implementation. Int.J. Computer Vision, 7:95-117, 1992. 

[13] A. Heyden. Reconstruction from image sequences by means of relative depths. In E. Grimson, 
editor, Int. Conf. Computer Vision, pages 1058-63, Cambridge, MA, June 1995. 

[14] A. Heyden and K. Astrom. A canonical framework for sequences of images. In IEEE 
Workshop on Representations of Visual Scenes, Cambridge, MA, June 1995. 

[15] A. Heyden and K. Astrom. Algebraic varieties in multiple view geometry. In European 
Conf. Computer Vision, pages 671-682. Springer- Verlag, 1996. 

[16] A. Heyden and K. Astrom. Euclidean reconstruction from constant intrinsic parameters. In 
Int. Conf. Pattern Recognition, pages 339-43, Vienna, 1996. 

[17] A. Heyden and K. Astrom. Algebraic properties of multilinear constraints. Mathematical 
Methods in the Applied Sciences, 20: 1135-1162, 1997. 

[18] A. Heyden, R. Berthilsson, and G. Sparr. An iterative factorization method for projective 
structure and motion from image sequences. Image & Vision Computing, 17(5), 1999. 

[19] M. Irani and P. Anadan. Parallax geometry of pairs of points for 3d scene analysis. In 
European Conf. Computer Vision, pages 17-30. Springer- Verlag, 1996. 

[20] M. Irani and P. Anadan. Eactorization with uncertainty. In European Conf. Computer Vision. 
Springer- Verlag, 2000. 

[21] M. Irani, P. Anadan, and M. Cohen. Direct recovery of planar-parallax from multiple frames. 
In Vision Algorithms: Theory and Practice. Springer- Verlag, 1999. 

[22] M. Irani, P. Anadan, and D. Weinshall. From reference frames to reference planes: Multi-view 
parallax geometry and applications. In European Conf. Computer Vision, pages 829-845. 
Springer- Verlag, 1998. 

[23] M. Irani and P. Anandan. A unified approach to moving object detection in 2d and 3d scenes. 
IEEE Trans. Pattern Analysis & Machine Intelligence, 20(6):577-589, June 1998. 




538 



B. Triggs 



[24] R. Kumar, R Anandan, M. Irani, J. Bergen, and K. Hanna. Representation of scenes from 
collections of images. In IEEE Workshop on Representations of Visual Scenes, pages 10-17, 
June 1995. 

[25] J. Oliensis. Multiframe structure from motion in perspective. In IEEE Workshop on Repre- 
sentation of Visual Scenes, pages 77-84, June 1995. 

[26] J. Oliensis and Y. Gene. Fast algorithms for projective multi-frame structure from motion. 
In Int. Conf. Computer Vision, pages 536-542, Corfu, Greece, 1999. 

[27] C.J. Poelman and T. Kanade. A parapersective factorization method for shape and motion 
recovery. In European Conf. Computer Vision, pages 97-108, Stockholm, 1994. Springer- 
Verlag. 

[28] A. Shashua and S. Avidan. The rank 4 constraint in multiple (> 3) view geometry. In 
European Conf. Computer Vision, pages 196-206, Cambridge, 1996. 

[29] A. Shashua and M. Werman. On the trilinear tensor of three perspective views and its 
underlying geometry. In Int. Conf. Computer Vision, Boston, MA, June 1995. 

[30] C. E. Springer. Geometry and Analysis of Projective Spaces. Freeman, 1964. 

[31] G. Stein and A. Shashua. Model-based brightness constraints: On direct estimation of 
structure and motion. In Int. Conf. Computer Vision & Pattern Recognition, pages 400-406, 
1997. 

[32] P. Sturm and B. Triggs. A factorization based algorithm for multi-image projective structure 
and motion. In European Conf. Computer Vision, pages 709-20, Cambridge, U.K., 1996. 
Springer- Verlag . 

[33] R. SzeliskiandS-B. Kang. Direct methods for visual scene reconstruction. In IEEE Workshop 
on Representation of Visual Scenes, pages 26-33, June 1995. 

[34] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a 
factorization method. Int.J. Computer Vision, 9(2): 137-54, 1992. 

[35] B. Triggs. The geometry of projective reconstruction I: Matching constraints and the joint 
image. Submitted to Int.J. Computer Vision. 

[36] B. Triggs. Matching constraints and the joint image. In E. Grimson, editor, Int. Conf. Com- 
puter Vision, pages 338-43, Cambridge, MA, June 1995. 

[37] B . Triggs. Factorization methods for projective structure and motion. In Int. Conf. Computer 
Vision & Pattern Recognition, pages 845-51, San Francisco, CA, 1996. 

[38] B . Triggs. Linear projective reconstruction from matching tensors. In British Machine Vision 
Conference, pages 665-74, Edinburgh, September 1996. 

[39] B. Triggs. Autocalibration and the absolute quadric. In Int. Conf. Computer Vision & Pattern 
Recognition, Puerto Rico, 1997. 

[40] B. Triggs. Linear projective reconstruction from matching tensors. Image & Vision Compu- 
ting, 15(8):617-26, August 1997. 

[41] B. Triggs. Autocalibration from planar scenes. In European Conf. Computer Vision, pages 
I 89-105, Ereiburg, June 1998. 

[42] D. Weinshall, P Anandan, and M. Irani. From ordinal to euclidean reconstruction with partial 
scene calibration. In R. Koch and L. Van Gool, editors, 3D Structure from Multiple Images 
of Large-scale Environments SMILE’98, pages 208-223. Springer- Verlag, 1998. 

[43] D. Weinshall, M. Werman, and A. Shashua. Shape tensors for efficient and learnable indexing. 
In IEEE Workshop on Representation of Visual Scenes, pages 58-65, June 1995. 

[44] L. Zelnik-Manor and M. Irani. Multi-frame alignment of planes. In Int. Conf. Computer 
Vision & Pattern Recognition, pages 151-156, 1999. 

[45] L. Zelnik-Manor and M. Irani. Multi-view subspace constraints on homographies. In 
Int. Conf. Computer Vision, pages 710-715, 1999. 




Factorization with Uncertainty 



Michal Irani^ and P. Anandan^ 

^ Department of Computer Science and Applied Mathematics 
The Weizmann Institute of Science, Rehovot 76100, Israel 



^ Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA 



Abstract. Factorization using Singular Value Decomposition (SVD) is 
often used for recovering 3D shape and motion from feature correspon- 
dences across multiple views. SVD is powerful at finding the global solu- 
tion to the associated least-square-error minimization problem. However, 
this is the correct error to minimize only when the x and y positional 
errors in the features are uncorrelated and identically distributed. But 
this is rarely the case in real data. Uncertainty in featnre position de- 
pends on the underlying spatial intensity structure in the image, which 
has strong directionality to it. Hence, the proper measure to minimize is 
covariance- weighted squared-error (or the Mahalanobis distance). In this 
paper, we describe a new approach to covariance-weighted factorization, 
which can factor noisy feature correspondences with high degree of direc- 
tional uncertainty into structure and motion. Our approach is based on 
transforming the raw-data into a covariance-weighted data space, where 
the components of noise in the different directions are uncorrelated and 
identically distributed. Applying SVD to the transformed data now mi- 
nimizes a meaningful objective function. We empirically show that our 
new algorithm gives good results for varying degrees of directional uncer- 
tainty. In particular, we show that unlike other SVD-based factorization 
algorithms, our method does not degrade with increase in directionality 
of uncertainty, even in the extreme when only normal-flow data is avai- 
lable. It thus provides a unihed approach for treating corner-like points 
together with points along linear structures in the image. 



1 Introduction 



Factorization is often used for recovering 3D shape and motion from feature 
correspondences across multiple frames |8l4l5Kil7) . Singular Value Decomposi- 
tion (SVD) directly obtains the global minimum of the squared-error between 
the noisy data and the model. This is in contrast to iterative non-linear op- 
timization methods which may converge to a local minimum. However, SVD 
requires that the noise in the x and y positions of features are uncorrelated and 
have identical distributions. But, it is rare that the positional errors of feature 
tracking algorithms are uncorrelated in their x and y coordinates. Quality of 
feature matching depends on the spatial variation of the intensity pattern aro- 
und each feature. This affects the positional inaccuracy both in the x and in 
the y components in a correlated fashion. This dependency can be modeled by 
directional uncertainty (which varies from point to point, as is shown in Fig. 1). 
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(b) 

Fig. 1. Directional uncertainty indicated by ellipse, (a) Uncertainty of a sharp corner 
point. The uncertainty in all directions is small, since the underlying intensity structure 
shows variation in multiple directions, (b ) Uncertainty of a point on a flat curve, almost 
a straight line. Note that the uncertainty in the direction of the line is large, while the 
uncertainty in the direction perpendicular to the line is small. This is because it is hard 
to localize the point along the line. 





When the uncertainty in a feature position is isotropic, but different features 
have different variances, then scalar-weighted SVD can be used to minimize a 
weighted squared error measure [p. However, under directional uncertainty noise 
assumptions (which is the case in reality) , the error minimized by SVD is no lon- 
ger meaningful. The proper measure to minimize is the covariance-weighted error 
(the Mahalanobis distance). This issue was either ignored by researchers 
El, or else was addressed using other minimization approaches |5E]. Morris and 
Kanade jSj have suggested a unified approach for recovering the 3D structure 
and motion from point and line features, by taking into account their directio- 
nal uncertainty. However, they solve their objective function using an iterative 
non-linear minimization scheme. The line factorization algorithm of Quan and 
Kanade |S| is SVD-based. However, it requires a preliminary step of 2D pro- 
jective reconstruction, which is necessary for rescaling the line directions in the 
image before further factorization can be applied. This step is then followed by 
three sequential SVD minimization steps, each applied to different intermediate 
results. This algorithm requires at least seven different directions of lines. 

In this paper we present a new approach to factorization, which introduces 
directional uncertainty into the SVD minimization framework. The input is the 
noisy positions of image features and their inverse covariance matrices which re- 
present the uncertainty in the data. Following the approach of Irani |2] , we write 
the image position vectors as row vectors, rather than as column vectors as is 
typically done in factorization methods. This allows us to use the inverse cova- 
riance matrices to transform the input position vectors into a new data space 
(the “covariance- weighted space”), where the noise is uncorrelated and identi- 
cally distributed. In the new covariance-weighted data space, corner points and 
points on lines all have the same reliability, and their new positional components 
are uncorrelated. (This is in contrast with the original data space, where corner 
points and points on lines had different reliability, and their x and y components 
were correlated.) 

We apply SVD factorization to the covariance-weighted data to obtain a 
global optimum. This minimizes the Mahalanobis distance in the original data 
space. However, the covariance-weighted data space has double the rank of the 
original data space. To obtain the required additional rank-halving, we use a 
least-squares minimization step within the double-rank subspace. 

Our approach allows the recovery of 3D motion for all frames and the 3D 
shape for all points, even when the uncertainty of point position is highly elliptic 
(for example, point on a line). It can handle reliable corner-like point correspon- 
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dences and partial correspondences of points on lines (e.g., normal flow), all 
within a single SVD like framework. In fact, we can handle extreme cases when 
the only image data available is normal flow. 

Irani |2| used confidence- weighted subspace projection directly on spatio- 
temporal brightness derivatives, in order to constrain multi-frame correspon- 
dence estimation. The confidences she used encoded directional uncertainty as- 
sociated with each pixel. That formulation can be seen a special case of the 
covariance-weighted factorization presented in this paper. 

Our approach thus extends the use of the powerful SVD factorization tech- 
nique with a proper treatment of directional uncertainty in the data. Different 
input features can have different directional uncertainties with different ellipti- 
cities (i.e., different covariance matrices). However, our extension does not allow 
arbitrary changes in the uncertainty of a single feature over multiple frames. We 
are currently able to handle the case where the change in the covariance matrices 
of all of the image features can be modeled by a global 2D affine transformation, 
which varies from frame to frame. 

The rest of the paper is organized as follows: Section 2 contains a short review 
of SVD factorization and formulates the problem for the case of directional uncer- 
tainty. Section 3 describes the transition from the raw data space, where noise is 
correlated and non-uniform, to the covariance-weighted data space, where noise 
is uniform and uncorrelated, giving rise to meaningful SVD subspace projection. 
Section 4 explains how the covariance- weighted data can be factored into 3D mo- 
tion and 3D shape. Section 5 extends the solution presented in Sections 3 and 4, 
to a more general case when the directional uncertainty of a point changes across 
views. Section 6 provides experimental results and empirical comparison of our 
factorization method to other common SVD factorization methods. Section 7 
concludes the paper. 



2 Problem Formulation 



2.1 SVD Factorization 



A set of P points are tracked across F images with coordinates {{u'jp,v'jp) \ 
f = 1 ... F, p = 1, , P}. The point coordinates are transformed to object- 
centered coordinates by subtracting their center of mass: (u'fp,Vjp) is replaced 
by (ufp,Vfp) = {u'jp — Uf,v'^p — Vf) for all / and p, where Uf and Vf are the 
centroids of point positions in each frame: Uf — ~ p Sp '^/p- 

Two F X P measurement matrices U and V are constructed by stacking all 
the measured correspondences as follows: 





■ Mil • 


■ Uip ' 




■ Mil • 


■ vip ■ 


u = 


-UfI ■ 


■ Upp. 


, V = 


-Vpi ■ 


• Vpp_ 



It was shown that when the camera is an affine camera (i.e., orthographic, 

weak-perspective, or paraperspective) , and when there is no noise, then the rank 



of W 



U 

V 



is 3 or less, and can be factored into a product of a motion 



2FxP 
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matrix M and a shape matrix S, i.e., W = MS, where: 



M = 


1 1 


, Mu = 


r T 1 

rrii 


, Mv = 


1 

s 

^4 


, S [Sl * • • 




2FX3 


T 

_mjp_ 


FX3 


e 


FX3 



The rows of M encode the motion for each frame (rotation in the case of or- 
thography), and the columns of S contain the 3D position of each point in the 
reconstructed scene. 

When there are errors in the measurement matrix W, then each position 
(ufp Vfp)^ has a 2D noise vector associated with it 



^fp — 



Ufp - mjSp 
V fp Tl f Sp 



When the noise £ fp is an isotropic Gaussian random variable with a fixed vari- 
ance a^, i.e., V/ Vp £fp ~ N{0, < 7 ^ 12 x 2 )^ then the maximum likelihood estimate 
is obtained by minimizing the squared error: 



EiTgyj3 (M, S)=Y, £jp£fp = II W - MS\\l 

f,p 

where || • ||p denotes the Frobenius norm. The global minimum to this non-linear 
problem is obtained by performing Singular Value Decomposition (SVD) on the 
measurement matrix: W = ASB'^ , and setting to zero all but the three largest 
singular values in E, to get a noise-cleaned matrix W = AEB"^ . The recovered 
motion and shape matrices M and S are then obtained by: M = AE^^^, and 
S = E^I'^B. Note that M and S are defined only up to an affine transformation. 



2.2 Scalar Uncertainty 



The model in Section 2.1 (as well as in jE]) weights equally the contribution 
of each point feature to the final shape and motion matrices. However, when 
the noise £fp is isotropic, but with different variances for the different points 
{cTp \ p= 1 ■ ■ • P}, then £fp ^ N{0, 7 ^ 12 x 2 )- In such cases, applying SVD to the 



weighted-matrix Wo- = W(J 
the correct error function: 



p 

where (7~ 

?T 



= diag(tT]^ , ...,ap ), will minimize 



Err 



weighted-SVD 
r-1 



(M,S) = 



E 









= \\{W-MS)a\\^ = \\W,,-MS^ 



where = S(J Applying SVD-factorization to W„- will give M and S„, from 



which S = ScCT can be recovered. This approach is known as weighted-SVD or 
weighted- factorization P . 



2.3 Directional Uncertainty 

So far we have assumed that the noise in Ufp is uncorrelated with the noise in 
Vfp. In real image sequences, however, this is not the case. Tracking algorithms 
introduce non-uniform correlated error in the tracked positions of points which 
depends on the local image structure. For example, a corner point p will be 
tracked with high reliability both in Ufp and in Vfp, while a point p on a line 
will be tracked with high reliability in the direction of the gradient (“normal 
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flow”), but with low reliability in the tangent direction (see Fig. 1). This leads 
to non-uniform correlated noise in Ufp and Vfp. We model the correlated noise 
£fp by: £fp ^ N{0, Qjp) where Qfp is the 2x2 inverse covariance matrix of the 
noise at point p in image-frame /. The covariance matrix determines an ellipse 
whose major and minor axes indicate the directional uncertainty in the location 
{ufp VfpY' of a point p in frame / (see Fig. 1, as well as 0 for some examples). 

Assuming that the noise at different points is independent, then the maxi- 
mum likelihood solution is obtained by finding matrices M and S which minimize 
the following objective function: 



^rr{M,S) = T,pp{£jpQfp£fp) 



~ 'y ' I [(^/p TTT-fSp) (Vfp UjSp)] Qfp 

f,p V 



^fp ^ 

Vjp Tljr Sp 



( 1 ) 



Eq. dU implies that in the case of directional uncertainty, the metric that we want 
to use in the minimization is the Mahalanobis distance, and not the Frobenius 
(least- squares) norm, which is the distance minimized by the SVD process. 

Morris and Kanade |3| have addressed this problem and suggested an ap- 
proach to recovering M and S which is based on minimizing the Mahalano- 
bis distance. However, their approach uses an iterative non-linear minimization 
scheme. In the next few sections we present our approach to SVD-based facto- 
rization, which minimizes the Mahalanobis error. Our approach combines the 
benefits of SVD-based factorization for getting a good solution, with the proper 
treatment of directional uncertaintjQ. However, unlike 0, our approach cannot 
handle arbitrary changes in covariance matrices of a single feature over multi- 
ple frames. It can only handle frame-dependent 2D affine deformations of the 
covariance matrices across different views (see Section ED . 



3 Prom Raw-Data Space to Covariance- Weighted Space 



In this section we show how by transforming the noisy data (i.e., correspon- 
dences) from the raw-data space to a new covariance-weighted space, we can 
minimize the Mahalanobis distance defined in Eq. m, while still retaining the 
benefits of SVD minimization. In particular, we will show that minimizing the 
Frobenius Norm in the new data spaee (e.g., via SVD) is equivalent to mini- 
mizing the Mahalanobis distance in the raw-data space. This transition is made 
possible by rearranging the raw feature positions in a slightly modified matrix 
form: [U \ P]fx 2 p> namely the matrices U and V stacked horizontally (as oppo- 



sed to vertically in W = 



U 

V 



which is the standard matrix form used in the 



traditional factorization methods (see Section 2.1)). This modified matrix repre- 
sentation is necessary to introduce covariance-weights into the SVD process, and 
was originally proposed by Irani [2(, who used it for applying confidence- weighted 
subspace projection to spatio-temporal brightness derivatives for computing op- 
tical flow across multiple frames. 



^ When directional uncertainty is used, the centroids {u/} and {h/} dehned in Sec- 
tion 2.1, are the covariance- weighted means over all points of {ufp} and {f/p} in 
frame /. 
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For simplicity, we start by investigating the simpler case when the directional 
uncertainty of a point does not change over time (i.e., frames), namely, when the 
2x2 inverse covariance matrix Qfp oi a, point p is frame- independent: V/ Qfp = 
Qp. Later, in Section 5, we will extend the approach to handle the case when the 
covariance matrices undergo frame-dependent 2D-afhne changes. Because Qp is 
positive semi-definite, its eigenvalue decomposition has the form Qp = QAff^ , 
where is a real orthonormal matrix, and = diag( Amax , ■ Let 

Cp = f2A^ and [a/p /3/p]ix2 = [ufp Vfp]ix 2 Cp Therefore, a/p is the 
component of [ujp Vfp] in the direction of the highest certainty (scaled by its 
certainty), and /3/p is the component in the direction of the lowest certainty 
(scaled by its certainty). For example, in the case of a point p which lies on a 
line, a/p would correspond to the component in the direction perpendicular to 
the line (i.e., the direction of the normal flow) , and flfp would correspond to the 
component in the direction tangent the line (the direction of infinite uncertainty) . 
In the case of a perfect line (i.e., zero certainty in the direction of the line), then 
/3/p = 0. When the position of a point can be determined with finite certainty in 
both directions (e.g., for corner points), then Cp is a regular matrix. Otherwise, 
when there is infinite uncertainty in at least one direction (e.g., as in lines or 
uniform image regions), then Cp is singular. 

Let Op, /3p, Up and Vp be four F x 1 vectors corresponding to a point p across 
all frames: ^ ^ .-o-, 
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; ^p 
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-Upp. 
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then 

[ap /3p] = [up Vp] Cp . (2) 

Let Q. and j3 be two F x P matrices: 



a = 


■ an • 


■ OllP ' 


and (3 = 


flu 


■ flip ' 




-OiFl ■ 


■ app. 


FXP 


-Afi ■ 


• Afp- 



then, according to Eq. 0: 

= (3) 



where C is a 2P x 2P matrix, constructed from all 2 x 2 matrices Cp = 
(p = 1 . . . P), as follows: 
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Note that matrix OL contains the components of all point positions in their direc- 
tions of highest certainty, and (3 contains the components of all point positions 
in their directions of lowest certainty. These directions vary from point to point 
and are independent. Furthermore, afp and /3fp are also independent, and the 
noise in those two components is now wncorrelated. This will be shown and used 
below. 



Let R denote the rank of W = 



U 

V 



2FxP 



(when W is noiseless, and the 



camera is an affine camera, then i? < 3; see Section 2.1). A review of different 
ranks R for different camera and world models can be found in | 2 |. Then the 
rank of U and the rank of V is each at most R. Hence, the rank of [U \ V]px 2 P 
is at most 2R (for an affine camera, in the absence of noise, 2R < 6) . Therefore, 
according to Eq. (0 , the rank of | /? ] is also at most 2R. 

The problem of minimizing the Mahalanobis distance of Eq. ([Q can be re- 
stated as follows: Given noisy positions { {ufp Vfp)'^ \ / = 1- --F, p =1- ■ ■ P}, 
find new positions {{ufp Vfp)'^ |/ = 1- --P, p = 1 ■ ■ ■ P} that minimize the 
following error function: 



Err^{(u/p Vfp) }j— ^^[(u/p Ufp) {vfp Vfp)]Qfp 

f,p 



'^fp '^fp 
'^fp ~ '^fp 



• ( 4 ) 



Because Q fp = Qp = CpCp, we can rewrite this error term as: 

= ^ [('*^/p “ Ufp) {vfp — Vfp)] Cp^ ^ [{ufp — Ufp) {vfp — Vfp)] Cp'^ 

f,p 

= \\[U-U\V-V]C\\l 
= \\[U\V]C-[U\V]C\\l 

= \\[a\P]-[a\P]\\l 

where [U \ V] is the P x 2P matrix containing all the {u/p,n/p}, and [CM \ / 3 ] = 
[U I V]C. Therefore: 



Minimizing the Mahalanobis distance of Eq. is equivalent to finding the 
rank-2P matrix [Ct | ( 3 ] closest to [Ct | /3] in the Frobenius norm. 



This minimization can be done by applying SVD subspace projection to the 
matrix [Ct | /?], to obtain the optimal [CM \ / 3 ]. This is done by applying SVD 
to the known [CM \ / 3 ] matrix, and setting to zero all but the highest 2R singular 
values. However, note that although optimal, [CM \ / 3 ] = [U \ V]C is in general a 



rank-2P matrix, and does not guaranty that W = 



is a rank-P matrix. In 



Section 4 we show how we complete the process by making the transition from 



the optimal rank-2P matrix [Ct | /3 ] to the rank-P solution W = 



Mu 

Myj 



S'. 



546 



M. Irani and P. Anandan 



4 Factoring Shape and Motion 



The process of finding the rank-2i? [d \ j3\, as outlined in Section 3, does not 
yet guarantee that the corresponding U and V can be decomposed into rank-i? 

S. In this section we complete the 



matrices as follows: 


'u 


= MS = 


'Mu 




y 




.Mv. 



\U \V\ = [d \ j3\C and then proceeded with applying standard SVD to 



process and recover M and S by enforcing this matrix constraint on U and V. 
Note that if C were an invertible matrix, then we could have recovered 

'u 

V 

to impose the rank-i? constraint and recover M and S. However, C is in general 
not invertible (e.g., because of points with high aperture problem). Imposing the 

rank-i? constraint on U = MjjS and V = MyS must therefore be done in the 

[d I /3] space (i.e., without inverting C): 



[« I P] PX2P = [MuS I MvS]C = [Mu I Mv] , 



'S 


o' 


.0 








( 5 ) 



Not every decomposition of [d \ /3] has the matrix form 



\S 


o' 


0 


S 



. However, 



if we are able to decompose [d \ [3\ into the matrix form of Eq. @, then the 
'Mu 



resulting M = 



Ml 



V J 



and S (which can be determined only up to an affine 



transformation) will provide the desired rank-i? solution. 

Because [d \ j3\p^^p is a rank-2i? matrix, it can be written as a bilinear 
product of an E X 2i? matrix H and a 2i? x 2P matrix G: 



[d\(5] 



TT ^ 

Fx2P Fx2r'^2Rx2P 



This decomposition is not unique. For any invertible 2i? x 2i? matrix D, 

[d\ P] = {HD-^){DG) is also a valid decomposition. We seek a matrix D 
which will bring DG into a form 



DG = 



\S 


O' 


0 


s 



( 6 ) 



where S is an arbitrary Rx P matrix. This is a linear system of equations in the 
unknown components of S and D. We therefore linearly solve for S and D, from 
which the desired solution is obtained by: S := S and [Mu [ My] ■= PtD~^. 



4.1 Summary of the Algorithm 

We summarize the steps of the algorithm: 
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Step 1: Project the covariance-weighted data [d \ (3\ = \U \ V\C onto a 

2i?-dimensional subspace (i.e., a rank-2i? matrix) [Ct | (3\ (for an affine 
camera 2R < 6). This step is guaranteed to obtain the closest 2i?-dimensional 
subspace because of the global optimum property of SVD. 



Step 2 : Further enforce the rank-i? solution by enforcing that 



'u' 




'Mu 


V 




Mv. 



This additional subspace projection is achieved within the [Oi \ (3\ space, and 
is obtained with simple least squares minimization applied to the linear set 
of equations (E) • 



Note that the Rank-i? subspace obtained by the second step is contained in- 
side the Rank-2i? subspace obtained in the first step. We cannot prove that the 
optimal Rank-i? solution is guaranteed to lie within this Rank-2i? subspace. Ho- 
wever, the bulk of the optimization task is performed in Step 1, which takes the 
noisy high-dimensional data into the Rank-2i? subspace in an optimal fashion. 
Moreover, both steps of our algorithm are linear. Our empirical results presented 
in Section 6 indicate that our two-step algorithm accurately recovers the motion 
and shape, while taking into account varying degrees of directional uncertainty. 



5 Frame-Dependent Directional Uncertainty 

So far we have assumed that all frames share the same 2x2 inverse covariance 
matrix Qp for a point p, i.e., \/fQfp = Qp and thus Cfp = Cp. This assumption, 
however, is very restrictive, as image motion induces changes in these matrices. 
For example, a rotation in the image plane induces a rotation on Cfp (for all 
points p). Similarly, a scaling in the image plane induces a scaling in Cfp, and 
so forth for skew in the image plane. (Note, however, that a shift in the image 
plane does not change Cfp.) 

The assumption V/ Cfp = Cp was needed in order to obtain the separable 
matrix form of Eq. (0. thus deriving the result that the rank of [Ct | /?] is at 
most 2R. Such a separation can not be achieved for inverse covariance matrices 
Qfp which change arbitrarily and independently. However, a similar result can 
be obtained for the case when all the inverse covariance matrices of all points 
change over time in a “similar way” . 

Let {Qp \ p = 1 • • • P} be “reference” inverse covariance matrices of all the 
points (in Section 5.2 we explain how these are chosen). Let {Cp \ p = 1 • • • P} 
be defined such that CpCp = Qp {Cp is uniquely defined by the eigenvalue 
decomposition, same as defined in Section 0). In this section we show that if 
there exist 2x2 “deformation” matrices {Af \ f = 1, . . . , F} such that: 

Vp,yf: Cfp = AfCp , (7) 

then the approach presented in Sections 3 and 4 still applies. 

Such 2x2 matrices {Af} can account for global 2D affine deformations in the 
image plane (rotation, scale, and skew). Note that while Cfp is different in every 
frame / and at every point p, they are not arbitrary. For a given point p, all its 
2x2 matrices Cfp across all views share the same 2x2 reference matrix Cp 
(which captures the common underlying local image structure and degeneracies 
in the vicinity of p), while for a given frame (view) /, the matrices Cfp of all 
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points within that view share the same 2x2 “affine” deformation Af (which 
captures the common image distortion induced on the local image structure 
by the common camera motion). Of course, there are many scenarios in which 
Eq. (0 will not suffice to model the changes in the inverse covariance matrices. 
However, the formulation in Eq. (0 does cover a wide range of scenarios, and 
can be used as a first-order approximation to the actual changes in the inverse- 
covariance matrices in the more general case. In Section 5.2 we discuss how we 
choose the matrices {Cp} and {Af}. 

We next show that under the assumptions of Eq. 0, the rank of [Ct \ / 3 ] 
is still at most 2R. Let [afp Pfp]i ^2 ~ [^fp '^fp\ix 2 ^fp 2 x 2 (this is the same 
definition as in Section 3, only here we use Cfp instead of Cp). Then: 

[cyfp Pfp] = [w/p Vfp]AfCp = [ufp Vfp]Cp 



where [ufp Vfp] = [ufp Vfp]Af. Let U be the matrix of all Ufp and V be the 
matrix of all Vfp. Because Cp is shared by all views of the point p, then (just 
like in Eq. (0): „ ^ ^ 

[a\[5] = [u\v]c 



where C is the same 2P x 2P maHix^efined in Section 3. Therefore the rank of 
[Ct I /3] is at most the rank of [U | E]. We still need to show that the rank of 
\U I V] is at most 2R (at most 6). According to the definition of Ufp and Vfp\ 
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This implies that the rank of 



U 

V 



is at most R, and therefore the rank of [C7 | V] 



is at most 2R. Therefore, the rank of [Ct | /3 ] is at most 2R even in the case of 
“affine-deformed” inverse covariance matrices. 



5.1 The Generalized Factorization Algorithm 

The factorization algorithm summarized in Section 4.1 can be easily generalized 
to handle the case of affine-deformed directional uncertainty. Given matrices 
{Af I / = 1 • • • F} and {Cp | p = 1, • • • P}, such that C/p = AfCp, then the 
algorithm is as follows: 
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Step 0: For each point p and each frame / compute: 



Ufp 

^fp 



= A 



f. 



Ufp 

'^fp 



Steps 1 and 2: Use the same algorithm (Steps 1 and 2) as in Section 4.1 (with 

the matrices {Cp \ p = 1 ■■ ■ P}, but apply it to the matrix [t/ | U] instead of 
[U I V]. These two steps yield the matrices S, My, and My, where 






= At 






Step 3: Recover Mjj and My by solving for all frames /: 



m 



= {A^) 



-1 
/ Ux2 



5.2 Choosing the Matrices Af and Cp 



Given a collection of inverse covariance matrices, {Qfp \ f = 1- ■ ■ F, p = I-- - P}, 
Eq. (EJ is not guaranteed to hold. However, we will look for the optimal collec- 
tion of matrices {Aj | / = 1 • • • F} and {Cp \ p = 1 • • • P} such that the error 
J2f pW^fp ~ ^f^pW is minimized (where CfpCjp = Qfp). These matrices {Af} 
and {Cp} can then be used in the generalized factorization algorithm of Sec- 
tion 5.1. 

Let F be a 2F x 2F matrix which contains all the individual 2x2 matrices 



{C/p|/ = i---F, p=i---py. 








'Cn 




Cip' 


E = 










Cfi 




Cfp 



2Fx2P 



When all the C/p’s do satisfy Eq. 0, then the rank of E is 2, and it can be 
factored into the following two rank-2 matrices: 



E = 




[Cl|---|C^k2P • 



2 Fx 2 



When the entries of F (the matrices {Cfp}) do not exactly satisfy Eq. 0, 
then we recover an optimal set of {Af} and {Cp} (and hence Cfp = AfCp), by 
applying SVD to the 2F x 2P matrix F, and setting to zero all but the two 
highest singular values. Note that {Af} and {Cp} are determined only up to a 
global 2x2 affine transformation. 



6 Experimental Results 

This section describes our experimental evaluation of the covariance weighted 
factorization algorithm described in this paper. In particular, we demonstrate 
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two key properties of this algorithm: (i) that its factorization of multi-frame posi- 
tion data into shape and motion is accurate regardless of the degree of ellipticity 
in the uncertainty of the data - i.e., whether the data consists of “corner-like” 
points, “line-like” points (i.e., points that lie on linear image structures), or 
both, and (ii) that in particular, the shape recovery is completely unhampered 
even when the positional uncertainty of a feature point along one direction is 
very large (even infinite, such as in the direction of pure normal flow). We also 
contrast its performance with two “bench-marks” - regular SVD (with no un- 
certainty taken into account; see Section EH) and scalar- weighted SVD, which 
allows a scalar uncertainty (see Section El. We performed experiments with 
synthetically generated data, in order to obtain a quantitative comparison of 
the different methods against ground truth under varying conditions. 

In our experiments, we randomly generated 3D points and affine motion 
matrices to create ground-truth positional data of multiple features in multiple 
frames. We then added elliptic Gaussian noise to this data. We varied the ellip- 
ticity of the noise to go gradually from being fully circular to highly elliptic, up 
to the extreme case when the uncertainty at each point is infinite in one of the 
directions. 

Specifically, we varied the shape of the uncertainty ellipse by varying the 
parameter r\ = \J Xmax! ^min, where Xmax and Xmin correspond to the major 
and minor axes of the uncertainty ellipse (these are the eigenvalues of the covari- 
ance matrix of the noise in feature positions). In the first set of experiments, the 
same value r\ was used for all the points for a given run of the experiment. The 
orientation of the ellipse for each point was chosen independently at random. 
In addition, we included a set of trials in which Xmin — 0 {r\ = oo) for all the 
points. This corresponds to the case when only “normal flow” information is 
available (i.e., infinite uncertainty along the tangential direction). 

We ran 20 trials for each setting of the parameter r\. For each trial of our 
experiment, we randomly created a cloud of 100 3D-points, with uniformly 
distributed coordinates. This defined the ground-truth shape matrix S. We ran- 
domly created 20 affine motion matrices, which together define the ground-truth 
motion matrix M. The affine motion matrices were used to project each of the 
100 points into the different views, to generate the noiseless feature positions. 

For each trial run of the experiment, for each point in our input dataset, we 
randomly generated image positional noise e/p with directional uncertainty as 
specified above. The noise in the direction of Xmax (the least uncertain direction) 
varied between 1% and 2% of the feature positions, whereas the noise in the 
direction of Xmin (the most uncertain direction), varied between 1% and 30% 
of the feature positions. This noise vector was added to the true position vector 
(ufpVfp)'^ to create the noisy input matrices U and V. 

The noisy input data was then fed to three algorithm: the covariance- weighted 
factorization algorithm described in this paper, the regular SVD algorithm, and 
the scalar-weighted SVD algorithm, for which the scalar-weight at each point 
was chosen to be equal to \/Xmax * Xmin (which is equivalent to taking the de- 
terminant of the matrix (7/p at each point). Each algorithm outputs a shape 
matrix S and a motion matrix M . These matrices were then compared against 
the ground-truth matrices S and M: es = eu = 

where Sn and M^v are S and M after transforming them to be in the same 
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Fig. 2. Plots of error in motion and shape w.r.t. ground truth for all three algorithms 
(Covariance-weighted SVD, scalar-weighted SVD, regular SVD). (a,b) Plots for the case 
when all points have the similar elliptical uncertainty, which is gradually increased (a 
= motion error, b = shape error). (c,d) Plots for the case when half of the points have 
fixed circular uncertainty, and the other half have varying elliptical uncertainty (c = 
motion error, d = shape error). The displayed shape error in this case is the computed 
error for the group of elliptic points (the “bad” points). 



coordinate system as S and M. These errors were then averaged over the 20 
trials for each setting of the parameter r\. 

Fig. |21a and 0b display the errors in the recovered motion and shape for 
all three algorithms as a function of the degree of ellipticity in the uncertainty 
r\ = \J \max I y^min- In this particular case, the behavior of regular SVD and 
scalar-weighted SVD is very similar, because all points within a single trial (for 
a particular finite r\), have the same confidence (i.e., the same scalar- weight). 
Note how the error in the recovered shape and motion increases rapidly for the 
regular SVD and for the scalar- weighted SVD, while the covariance- weighted 
SVD consistently retains very high accuracy (i.e., very small error) in the re- 
covered shape and motion. The error is kept low and uniform even when the 
elliptical uncertainty is infinite {r\ = oo; i.e., when only normal-flow informa- 
tion is available). This point is out of the displayed range of this graph, but is 
visually displayed (for a similar experiment) in Fig. 0 

In the second set of experiments, we divided the input set of points into two 
equal subsets of points. For one subset, we maintained a circular uncertainty 
through all the runs (i.e., for those points ta = 1) , while for the other sub- 
set we gradually varied the shape of the ellipse in the same manner as in the 
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previous experiment above (i.e., for those points r\ is varied from 1 to oo). In 
this case, the quality of the reconstruction motion for the scalar-weighted SVD 
showed comparable results (although still inferior) to the covariance-weighted 
SVD (see Fig. 0c), and significantly better results than the regular SVD. The 
reason for this behavior is that “good” points (with r\ = 1) are weighted highly 
in the scalar-weighted SVD (as opposed to the regular SVD, where all points 
are weighted equally). However, while the recovered shape of the circularly sym- 
metric (“good”) points is quite accurate and degrades gracefully with noise, the 
error in shape for the “bad” elliptical points (points with large r>) increases ra- 
pidly with the increase of r^-, both in the scalar- weighted SVD and in the regular 
SVD. The error in shape for this group of points (i.e., half of the total number 
of points) is shown in Fig. 0d . Note how, in contrast, the covariance-weighted 
SVD maintains high quality of reconstruction both in the motion and in shape. 

In order to visualize the results (i.e., visually compare the shape reconstruc- 
ted by the different algorithms for different types of noise), we repeated these 
experiments, but this time instead of applying it to a random shape, we applied 
it to a well defined shape - a cube. We used randomly generated affine motion 
matrices to determine the positions of 726 cube points in 20 different views, then 
corrupted them with random noise as before. Sample displays of the reconstruc- 
ted cube by covariance- weighted algorithm vs. the regular SVD algorithm are 
shown in Fig.0for three interesting cases: case of circular Gaussian noise r\ = 1 
for all the points (Figs.|3a and 0d), case of elliptic Gaussian noise with r\ = 20 
(Figs. 0b and0e), and the case of pure “normal flow”, when Xmin = 0 (?'A = oo) 
(Figs. 0c and0f). (For visibility sake, only 3 sides of the cube are displayed). 
The covariance-weighted SVD (top row) consistently maintains high accuracy of 
shape recovery, even in the case of pure normal- flow. The shape reconstruction 
obtained by regular SVD (bottom row), on the other hand, degrades severely 
with the increase in the degree of elliptical uncertainty. Scalar-weighted SVD 
reconstruction was not added here, because when all the points are equally re- 
liable, then scalar-weighted SVD coincides with regular-SVD (see Fig. 0b), yet 
it is not defined for the case of infinite uncertainty (because then all the weights 
are equal to zero). 

7 Conclusion 

In this paper we have introduced a new algorithm for performing covariance- 
weighted factorization of multiframe correspondence data into shape and mo- 
tion. Unlike the regular SVD algorithms which minimize the Frobenius norm 
error in the data, or the scalar-weighted SVD which minimizes a scalar-weighted 
version of that norm, our algorithm minimizes the covariance weighted error 
(or the Mahalanobis distance). This is the proper measure to minimize when 
the uncertainty in feature position is directional. Our algorithm transforms the 
raw input data into a covariance- weighted data space, and applies SVD in this 
transformed data space, where the Frobenius norm now minimizes a meaning- 
ful objective function. This SVD step projects the covariance-weighted data to 
a 2i?-dimensional subspace. We complete the process with an additional linear 
estimation step to recover the rank R shape and motion estimates. 

A fundamental advantage of our algorithm is that it can handle input data 
with any level of ellipticity in the directional uncertainty - i.e., from purely 
circular uncertainty to highly elliptical uncertainty, even including the case of 
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Fig. 3. Reconstructed shape of the cube by the Covariance-weighted SVD (top row) 
vs. the regular SVD (bottom row). For visibility sake, only 3 sides of the cube are 
displayed. (a,d) case of circularly symmetric noise. (b,e) case of elliptical noise with 
ratio r\ = 20. (c,f) case of pure “normal flow” (only line-like features) r\ = oo. Note 
that the quality of shape reconstruction of the covariance weighted factorization method 
does not degrade with the increase in the degree of ellipticity, while in the case of regular 
SVD, it degrades rapidly. 



points along lines where the uncertainty along the line direction is infinite. It can 
also simultaneously use data which contains points with different levels of direc- 
tional uncertainty. We empirically show that our algorithm recovers shape and 
motion accurately, even when the more conventional SVD algorithms perform 
poorly. However, our algorithm cannot handle arbitrary changes in the uncer- 
tainty of a single feature over multiple frames (views). It can only account for 
frame dependent 2D affine deformations in the covariance matrices. 
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Abstract. The Singular Value Decomposition (SVD) of a matrix is a 
linear algebra tool that has been successfully applied to a wide variety 
of domains. The present paper is concerned with the problem of estima- 
ting the Jacobian of the SVD components of a matrix with respect to 
the matrix itself. An exact analytic technique is developed that facili- 
tates the estimation of the Jacobian using calculations based on simple 
linear algebra. Knowledge of the Jacobian of the SVD is very useful in 
certain applications involving multivariate regression or the computation 
of the uncertainty related to estimates obtained through the SVD. The 
usefulness and generality of the proposed technique is demonstrated by 
applying it to the estimation of the uncertainty for three different vision 
problems, namely self-calibration, epipole computation and rigid motion 
estimation. 



1 Introduction and Motivation 

The SVD is a general linear algebra technique that is of utmost importance 
for several computations involving matrices. For example, some of the uses of 
SVD include its application to solving ordinary and generalized least squares 
problems, computing the pseudo-inverse of a matrix, assessing the sensitivity 
of linear systems, determining the numerical rank of a matrix, carrying out 
multivariate analysis and performing operations such as rotation, intersection, 
and distance determination on linear subspaces m- Owing to its power and 
flexibility, the SVD has been successfully applied to a wide variety of domains, 
from which a few sample applications are briefly described next. Zhang et al m, 
for example, employ the SVD to develop a fast image correlation scheme. The 
problem of establishing correspondences is also addressed by Jones and Malik 
HI, who compare feature vectors defined by the responses of several spatial 
filters and use the SVD to determine the degree to which the chosen filters 
are independent of each other. Structure from motion is another application 
area that has greately benefited from the SVD. Longuet-Higgins m and later 
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network. 



D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 554-[CT3 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



Estimating the Jacobian of the SVD 555 



Hartley m extract the translational and rotational components of rigid 3D 
motion using the SVD of the essential matrix. Tsai et al m also use the SVD 
to recover the rigid motion of a 3D planar patch. Kanade and co-workers 

assume special image formation models and use the SVD to factorize 
image displacements to structure and motion components. Using an SVD based 
method, Sturm and Triggs 1341 recover projective structure and motion from 
uncalibrated images and thus extend the work of I35I28I to the case of perspective 
projection. The SVD of the fundamental matrix yields a simplified form of the 
Kruppa equations, on which self-calibration is based fnmm- Additionally, 
the SVD is used to deal with important image processing problems such as 
noise estimation US], image coding HD| and image watermarking HH|. Several 
parametric fitting problems involving linear least squares estimation, can also be 
effectively resolved with the aid of the SVD mmm- Finally, the latter has also 
proven useful in signal processing applications mm and pattern recognition 
techniques such as neural networks computing and principal components 

analysis 0. 

This paper deals with the problem of computing the Jacobian of the SVD 
components of a matrix with respect to the elements of this matrix. Knowledge 
of this Jacobian is important as it is a key ingredient in tasks such as non-linear 
optimization and error propagation: 

— Several optimization methods require that the Jacobian of the criterion that 
is to be optimized is known. This is especially true in the case of complicated 
criteria. When these criteria involve the SVD, the method proposed in this 
paper is invaluable for providing analytical estimates of their Jacobians. As 
will be further explained latter, numerical computation of such Jacobians 
using finite differences is not as straightforward as it might seem at a first 
glance. 

— Computation of the covariance matrix corresponding to some estimated 
quantity requires knowledge of the Jacobians of all functions involved in the 
estimation of the quantity in question. Considering that the SVD is quite 
common in many estimation problems in vision, the method proposed in 
this work can be used in these cases for computing the covariance matrices 
associated with the estimated objects. 



Paradoxically, the numerical analysis litterature provides little help on this 
topic. Indeed, a lot of studies have been made on the sensitivity of singular 
values and singular vectors to perturbations in the original matrix jT'il 1 1 ItitfTHl 
Oj, but these globally consider the question of perturbing the input matrix and 
derive bounds for the singular elements but do not deal with perturbations due 
to individual elements. 

Thus, the method proposed here fills in an important gap, since, to the best 
of our knowledge, no similar method for SVD differentiation appears in the 
literature. The rest of this paper is organized as follows. Section El gives an ana- 
lytical derivation for the computation of the Jacobian of the SVD and discusses 
practical issues related to its implementation in degenerate cases. Section El il- 
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lustrates the use of the proposed technique with three examples of covariance 
matrix estimation. The paper concludes with a brief discussion in section 2] 



2 The Proposed Method 

2.1 Notation and Background 

In the rest of the paper, bold letters will be used for denoting vector and matrices. 
The transpose of matrix M is denoted by and rriij refers to the {i, j) element 
of M. The i-th non-zero element of a diagonal matrix D is referred to by dj, 
while Mj designates the Tth column of matrix M. 

A basic theorem of linear algebra states that any real M x N matrix A with 
M > N can be written as the product of an M x A column orthogonal matrix 
U, an A X A diagonal matrix D with non-negative diagonal elements (known as 
the singular values), and the transpose of an A x A orthogonal matrix V 1 1 1 )j . 
In other words, 

N 

A = UDV^ = ^d,U,Vf. (I) 

i=l 

The singular values are the square roots of the eigenvalues of the matrix 
AA^ (or A^A since these matrices share the same non-zero eigenvalues) while 
the columns of U and V (the singular vectors) correspond to the eigenvectors of 
AA^ and A^A respectively As defined in Eq. m, the SVD is not unique 
since 

— it is invariant to arbitrary permutations of the singular values and their 
corresponding left and right singular vectors. Sorting the singular values 
(usually by decreasing magnitude order) solves this problem unless there 
exist equal singular values. 

— simultaneous changes in the signs of the vectors and do not have any 
impact on the leftmost part of Eq. dO- In practice, this has no impact on 
most numerical computations involving the SVD. 



2.2 Computing the Jacobian of the SVD 

Employing the definitions of section tz. II we are interested in computing 



and for every element a^- of the M x N matrix A. 

Taking the derivative of Eq. with respect to Uij yields the following equation 



dA dV 



da. 



dai 



DV^ 



da,. 



UD 



dY 

dan 



( 2 ) 



Clearly, V {k,l) ^ (i,j), = 0, while = 1. Since U is an orthogonal 

matrix, we have the following: 



U = I : 



dai. 



U 



dan 



= 0 , 



(3) 
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where is given by 



fjjj = U'^ 



9U 

dciij 



( 4 ) 



From Eq. it is clear that [2^ is an antisymmetric matrix. Similarly, an anti- 
symmetric matrix 17^ can be defined for V as 



dV 

^ da,- 



V 



( 5 ) 






Notice that 17^ are specific to each differentiation gf:-. 

By multiplying Eq. by and V from the left and right respectively, 
and using Eqs. dS and ®, the following relation is obtained: 






ddj/i 



|^+Df7^. 

udj/i 



( 6 ) 



Since 17^ 17^ are antisymmetric matrices, all their diagonal elements 

are equal to zero. Recalling that D is a diagonal matrix, it is easy to see that 
the diagonal elements of fJjjD and Dl7^ are also zero. Thus, Eq. (jOJ yields the 
derivatives of the singular values as 



ddk 

daij 



'^ik '^jk • 



( 7 ) 



Taking into account the antisymmetry property, the elements of the matrices 
f7u 17^ can be computed by solving a set of 2 x 2 linear systems, which are 
derived from the off-diagonal elements of the matrices in Eq. 



d>i “h d}^ f2^ — 'dik dji 

dk ^\j }^i ’ 



( 8 ) 



where the index ranges are k = 1 . . . N and I = i + 1 . . . N . Note that, since 
the dk are positive numbers, this system has a unique solution provided that 
dk ^ di- Assuming for the moment that V (fc, 1), dk ^ di, the parameters 

defining the non-zero elements of 17^ 4?^ can be easily recovered by solving 

the corresponding 2x2 linear systems. 

Once I7u 17^ have been computed, and follow as: 



au 

Oa-ij 






dV 

da^j 






( 9 ) 



In summary, the desired derivatives are supplied by Eqs. o and 0. 



2.3 Implementation and Practical Issues 

In this section, a few implementation issues related to a practical application of 
the proposed method are considered. 
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Degenerate SVDs. Until now, the case where the SVD yields at least two 
identical singular values has been set aside. However, such cases occur often in 
practice, for example when dealing with the essential matrix (see section 1,8.81 
ahead) or when fitting a line in 3D. Therefore, let us now assume that = di 
for some k and 1. It is easy to see that these two singular values contribute to 
Eq. Q with the term di (UfeV^ + U/V^). The same contribution to A can 
be obtained by using any other orthonormal bases of the subspaces spanned by 
(Ufe,Ui) and (Vfe,Vi) respectively. Therefore, letting 

U). = cosa Ufc + sina U; 

Uj = —sina Ufc + cosa U; 

Vfc = cosa V k + sina V I 

V( = —sina Vfc + cosa V/ , 

for any real number a, we have UfcV^ + U;V)^ = U).V(,^ + U(V(^. This implies 
that in this case, there exists a one dimensional family of SVDs. Consequently, 
the 2x2 system of Eqs. ® must be solved in a least squares fashion in order to 
get only the component of the Jacobian that is “orthogonal” to this family. Of 
course, when more than two singular values are equal, all the 2x2 corresponding 
systems have to be solved simultaneously. The correct algorithm for all cases is 
thus: 

— Group together all the 2x2 systems corresponding to equal singular values. 

— Solve these systems using least squares. 

This will give the exact Jacobian in non-degenerate cases and the “minimum 
norm” Jacobian when one or more singular values are equal. 



Computational complexity. Assuming that the matrix A is N x N and non- 
degenerate for simplicity, it is easy to compute the complexity of the procedure 
for computing the Jacobian: For each pair (*, j), i = 1 . . . A, j = i + 1 . . . N, a, 
total of 2x2 linear systems have to be solved. In essence, the complexity 

of the method is 0{N'^) once the initial SVD has been carried out. 



Computing the Jacobian using finite differences. The proposed method 
has been compared with a finite difference approximation of the Jacobian and 
same results have been obtained in the non-degenerate case (degenerate cases 
are more difficult to compare due to the non-uniqueness of the SVD). Although 
the ease of implementation makes the finite difference approximation more ap- 
pealing for computing the Jacobian, the following points should also be taken 
into account: 

— The finite difference method is more costly in terms of computational com- 
plexity. Considering again the case of an A x A non-degenerate matrix as 
in the previous paragraph, it is simple to see that such an approach requires 
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SVD computations to be performed (i.e. one for each perturbation of 
each element of the matrix). Since the complexity of the SVD operation is 
0{N^), the overall complexity of the approach is 0{N^) which is an order of 
magnitude higher compared to that corresponding to the proposed method. 

— Actually, the implementation of a finite difference approximation to a Jaco- 
bian is not as simple as it might appear. This is because even state of the 
art algorithms for SVD computation (eg Lapack’s dgesvd family of routi- 
nes fP) are “unstable” with respect to small perturbations of the input. By 
unstable, we mean that the signs associated with the columns Ui and 
can change arbitrarily even with the slightest perturbation. In general, this 
is not important but it has strong effects in our case since the original and 
perturbed SVD do not return the same objects. Consequently, care has to be 
taken to compensate for this effect when the Jacobian is computed through 
finite differences. 



3 Applications 



In this section, the usefulness and generality of the proposed SVD differentiation 
method are demonstrated by applying it to three important vision problems. 
Before proceeding to the description of each of these problems, we briefly state 
a theorem related to error propagation that is essential for the developments in 
the subsections that follow. More specifically, let xq S TZ^ be a measurement 
vector, from which a vector yo G TZ^ is computed through a function f, i.e. 
yo = f(xo). Here, we are interested in determining the uncertainty of yo, given 
the uncertainty of Xq. Let x G TZ^ be a random vector with mean xq and 
covariance = E[{x. — xq)(x — xq)^]. The vector y = f(x) is also random and 
its covariance Ay up to first order is equal to 



A 



y ~ 



df(xo) i9f(xo) '^ 
9xo "" 9 xq 



( 10 ) 



where is the derivative of f at xq. For more details and proof, the reader 

is referred to |S| . In the following, Eq. IIIUII will be used for computing the uncer- 
tainty pertaining to various entities that are estimated from images. Since image 
measurements are always corrupted by noise, the estimation of the uncertainty 
related to these entities is essential for effectively and correctly employing the 
latter in subsequent computations. 



3.1 Self-Calibration Using the SVD of the Fhndamental Matrix 

The first application that we deal with is that of self-calibration, that is the 
estimation of the camera intrinsic parameters without relying upon the existence 
of a calibration object. Instead, self-calibration employs constraints known as 
the Kruppa equations, which are derived by tracking image features through an 
image sequence. More details regarding self-calibration can be found in 
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EO). Here, we restrict our attention to a self-calibration method that is based on a 
simplification of the Kruppa equations derived from the SVD of the fundamental 
matrix. In the following paragraph, a brief description of the method is given; 
more details can be found in I20I22I . 

Let Sf be the vector formed by the parameters of the SVD of the fundamental 
matrix F. The Kruppa equations in this case reduce to three linearly dependent 
constraints, two of which are linearly independent. Let 7Ti(SF,K), i = 1 . . .3 
denote those three equations as functions of the fundamental matrix F and the 
matrix K = AA^, where A is the 3x3 intrinsic calibration parameters matrix 
having the following well-known form |B|: 



A = 



Ctu ~ Ctu cot 6 Uq 
0 ay/ sin 0 vq 
0 0 1 



( 11 ) 



The parameters and ay correspond to the focal distances in pixels along the 
axes of the image, 9 is the angle between the two image axes and (uo,Vq) are the 
coordinates of the image principal point. In practice, 9 is very close to ^ for real 
cameras. The matrix K is parameterized with the unknown intrinsic parameters 
from Eq. (CB and is computed from the solution of a non-linear least squares 
problem, namely 



K = 



N 

argmin-^ 



^?(Sf.,K) 

<(Sf„K) 



ttKSf^K) ^ ^|(Sf„K) 
<(Sf„K) a 23 (SF„K) 



( 12 ) 



In the above equation, N is the number of the available fundamental matrices and 
crJ.(SFi,K) are the variances of constraints 7Ti(SF, K), i = 1 . . . 3, respectively, 
used to automatically weight the constraints according to their uncertainty. It is 
to the estimation of these variances that the proposed differentiation method is 
applied. More specifically, applying Eq. m to the case of the simplified Kruppa 
equations, it is straightforward to show that the variance of the latter is appro- 
ximated by 



(7 



,(Sf, K) 



9^,(Sf, K) 5Sf , OSf ^ 97t,(Sf, K) ^ 
dSp OF ^ dF 5Sf 



In the above equation, is the derivative of 7Ti(SF,K) at Sf, is the 

Jacobian of Sf at F and Ap is the covariance of the fundamental matrix, supplied 
as a by-product of the procedure for estimating F u The derivative is 

computed directly from the analytic expression for 7Ti(SF, K), while is esti- 
mated using the proposed method for SVD differentiation. To quantitatively as- 
sess the improvement on the accuracy of the recovered intrinsic parameters that 
is gained by employing the covariances, a set of simulations has been conducted. 
More specifically, three rigid displacements of a virtual camera were simulated 
and a set of randomly chosen 3D points were projected on the simulated retinas. 
Following this, the resulting retinal points were contaminated by zero mean ad- 
ditive Gaussian noise. The noise standard deviation was increased from 0 to 4 
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Mean a_u relative error vs. noise standard deviation 



Mean a_v relative error vs. noise standard deviation 





0.5 1 1.5 2 2.5 3 3.5 4 

Noise standard deviation (pixels) 

Std deviation of a_v relative error vs. noise standard deviation 




0.5 1 1.5 2 2.5 3 3.5 4 

Noise standard deviation (pixels) 




0.5 1 1.5 2 2.5 3 3.5 4 

Noise standard deviation (pixels) 



Fig. 1. The error in the recovered focal lengths in the presence of noise, with and 
without employing the covariance. Mean valnes are shown in the top row, standard 
deviations in the bottom. 



pixels in steps of 0.1. A non-linear method was then employed to estimate 
from the noisy retinal points the fundamental matrices corresponding to the si- 
mulated displacements. The estimates of the fundamental matrices serve as the 
input to self-calibration. To ensure that the recovered intrinsic calibration para- 
meters are independent of the exact location of the 3D points used to form 2D 
correspondences, 100 experiments were run for each noise level, each time using 
a different random set of 3D points. More details regarding the simulation can 
be found in j2Dj. Figures [f.l l a.nd rmillustrate the mean and standard deviation 
of the relative error for the intrinsic parameters versus the standard deviation of 
the noise added to image points, with and without employing the covariances. 
When the covariances are not employed, the weights cr^. (Sf, K) in Eq. uni) are 
all assumed to be equal to one. Throughout all experiments, zero skew has been 
assumed, i.e. 0 = 7t/2 and K in Eq. (I I 'JD was parameterized using 4 unknowns. 
As is evident from the plots, especially those referring to the standard deviation 
of the relative error, the inclusion of covariances yields more accurate and more 
stable estimates of the intrinsic calibration parameters. Additional experimental 
results can be found in m- At this point, it is also worth mentioning that in 
the case of self-calibration, the derivatives were also computed analytically 
by using Maple to compute closed-form expressions for the SVD components 
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with respect to the elements of F. Using the computed expressions for the SVD, 
the derivative of the latter with respect to F was then computed analytically. As 
expected, the arithmetic values of the derivatives obtained in this manner were 
identical to those computed by the differentiation method proposed here. 



Mean u_0 relative error vs. noise standard deviation 



Mean v_0 relative error vs. noise standard deviation 







Fig. 2. The error in the recovered principal points in the presence of noise, with and 
without employing the covariance. Mean values are shown in the top row, standard 
deviations in the bottom. 



3.2 Estimation of the Epipoles’ Uncertainty 

The epipoles of an image pair are the two image points defined by the projection 
of each camera’s optical center on the retinal plane of the other. The epipoles 
encode information related to the relative position of the two cameras and have 
been employed in applications such as stereo rectification m, self-calibration 
EanEHi, projective invariants estimation and point features matching 

0. Although it is generally known that the epipoles are hard to estimate ac- 
curateljQ 1241 . the uncertainty pertaining to their estimates is rarely quantified. 



^ This is particularly true when the epipoles lie outside the images. 
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Here, a simple method is presented that permits the estimation of the epipo- 
les’ covariance matrices based on the covariance of the underlying fundamental 
matrix. 

Let e and e' denote the epipoles in the first and second images of a stereo 
pair respectively. The epipoles can be directly estimated from the SVD of the 
fundamental matrix F as follows. Assuming that F is decomposed as F = UDV^ 
and recalling that F^e' = Fe = 0 , it is easy to see that e corresponds to the 
third column of V, while e' is given by the third column of U. The epipoles are 
thus given by two very simple functions fg and fg' of the vector Sf defined by the 
SVD of F. More precisely, fg(Sp) = V3 and fg'(SF) = U3, where V3 and U3 
are the third columns of matrices V and U respectively. A direct application of 
Eq. (II 1)11 can be used for propagating the uncertainty corresponding to F to the 
estimates of the epipoles. Since this derivation is analogous to that in section mi 
exact details are omitted and a study of the quality of the estimated covariances 
is presented instead. 

First, a synthetic set of corresponding pairs of 2D image points was generated. 
The simulated images were 640 x 480 pixels and the epipoles e and e' were within 
them, namely at pixel coordinates (458.123, 384.11) and (526, 402) respectively. 
The set of generated points was contaminated by different amounts of noise and 
then the covariances of the epipoles estimated analytically using Eq. G3) were 
compared to those computed using a statistical method which approximates 
the covariances by exploiting the laws of large numbers. In simpler terms, the 
mean of a random vector y can be approximated by the discrete mean of a 
sufficiently large number N of samples, defined by Eoly] = V and the 

corresponding covariance by 

- Eo[y]){y. - Eo[y]n (13) 

Assuming additive Gaussian noise whose standard deviation increased from 0.1 
to 2.0 in increments of 0.1 pixels, the analytically computed covariance estimates 
were compared against those produced by the statistical method. In particular, 
for each level of noise cr, 1000 noise-corrupted samples of the original correspon- 
ding pairs set were obtained by adding zero mean Gaussian noise with standard 
deviation cr to the original set of corresponding pairs. Then, 1000 epipole pairs 
were computed through the estimation of the 1000 fundamental matrices pertai- 
ning to the 1000 noise corrupted samples. Following this, the statistical estimates 
of the two epipole covariances were computed using Eq. fTHTl for N = 1000. To 
estimate the epipole covariances with the analytical method, the latter is applied 
to the fundamental matrix corresponding to a randomly selected sample of noisy 
pairs. 

In order to facilitate both the comparison and the graphical visualization of 
the estimated covariances, the concept of the hyper-ellipsoid of uncertainty is 
introduced next. Assuming that a M x 1 random vector y follows a Gaussian 
distribution with mean E[y] and covariance Ay, it is easy to see that the random 
vector X defined by y = Ay~^^'^{y—E\y]) follows a Gaussian distribution of mean 
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zero and of covariance equal to the M x M identity matrix I. This implies that 
the random variable Sy defined as 

Sy = x^x = (y - ^^[y])^^y”^(y - ^^[y]) 

follows a (chi-square) distribution with r degrees of freedom, r being the rank 
of Ay [ I Yldj . Therefore, the probability that y lies within the fc-hyper-ellipsoid 
defined by the equation 

{y-E[y]fAy-\y-E[y]) = k^, (14) 

is given by the cumulative probability function P^ 2 (fc, r)@ pn|. 



Estimated epipole covariances vs noise 




analytical e covariance — b — 
analytical e’ covariance — x — 
statistical e covariance — e — 

statistical e’ covariance 1 — 

ideal (75%) 



s. 









5 ?. 






0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 

Noise standard deviation (pixels) 



Fig. 3. The fraction of 1000 estimates of the two epipoles that lie within the uncertainty 
ellipses defined by the corresponding covariances computed with the analytical and 
statistical method. According to the criterion, this fraction is ideally equal to 75%. 



In the following, it is assumed that the epipoles are represented using points 
in the two dimensional Euclidean space Ti? rather than in the embedding pro- 
jective space This is simply accomplished by normalizing the estimates of 
the epipoles obtained from the SVD of the fundamental matrix so that their 

^ The norm defined by the left hand side of Eq. (m is sometimes referred to as the 
Mahalanobis distance. 
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third element is equal to one. Choosing the probability P^ 2 (fc,r) of an epipole 
estimate being within the ellipsoid defined by the covariance to be equal to 0.75, 
yields k = 1.665 for r = 2. Figure 0 shows the fractions of the epipole estima- 
tes that lie within the uncertainty ellipses defined by the covariances computed 
with the analytical and statistical method, as a function of the noise standard 
deviation. Clearly, the fractions of points within the ellipses corresponding to 
the covariances computed with the statistical method are very close to the theo- 
retical 0.75. On the other hand, the estimates of the covariances computed with 
the analytical method are satisfactory, with the corresponding fractions being 
over 0.65 when the noise does not exceed 1.5 pixels. 

The difference between the covariances estimated by the analytical and sta- 
tistical methods are shown graphically in Fig[3 for three levels of noise, namely 
0.1, 1.0 and 2.0 pixels. Since the analytical method always underestimates the 
covariance, the corresponding ellipses are contained in the ellipses computed by 
the statistical method. Nevertheless, the shape and orientation of the ellipses 
computed with the analytical method are similar to these of the statistically 
computed ellipses. 



3.3 Estimation of the Covariance of Rigid 3D Motion 

The third application of the proposed SVD differentiation technique concerns its 
use for estimating the covariance of rigid 3D motion estimates. It is well known 
that the object encoding the translation and rotation comprising the 3D motion 
is the essential matrix E. Matrix E is defined by E = [T]xR, where T and R 
represent respectively the translation vector and the rotation matrix defining 
a rigid displacement and [T] x is the antisymmetric matrix associated with the 
cross product: 



[T]x 



0 -Ta T2 
Ts 0 -Ti 

-T2 Ti 0 



There exist several methods for extracting estimates of the translation and ro- 
tation from estimates of the essential matrix. Here, we focus our attention to 
a simple linear method based on the SVD of E, described in [U112j . Assuming 
that the SVD of E is E = UDV^, there exist two possible solutions for the 
rotation R, namely R = UWV^ and R = UW^V^, where W is given by 



W = 



0 1 0 
-10 0 
0 0 1 



The translation is given by the third column of matrix V, that is T = V (0, 0, 1)^ 
with |T| = 1. The two possible choices for R combined with the two possible 
signs of T yield four possible translation-rotation pairs, from which the correct 
solution for the rigid motion can be chosen based on the requirement that the 
visible 3D points appear in the front of both camera viewpoints ng. In the 
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Uncertainty ellipses for e' when the noise std dev is 0.1 
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Uncertainty ellipses for e when the noise std dev is 1 .0 
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Uncertainty ellipses for e when the noise std dev is 2.0 
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Uncertainty ellipses for e’ when the noise std dev is 2.0 
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Fig. 4. The ellipses defined by Eq. JH using k = 0.75 and the covariances of the two 
epipoles computed by the statistical and analytical method. The left column corre- 
sponds to e, the right to e'. The standard deviation of the image noise is 0.1 pixels 
for the first row, 1.0 and 2.0 pixels for the middle and bottom rows respectively. Both 
axes in all plots represent pixel coordinates while points in the plots marked with grey 
dots correspond to estimates of the epipoles obtained during the statistical estimation 
process. 
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following, the covariance corresponding only to the first of the two solutions for 
rotation will be computed; the covariance of the second solution can be computed 
in a similar manner. 

Supposing that the problem of camera calibration has been solved, the es- 
sential matrix can be recovered from the fundamental matrix using 

E = A^FA, 



where A is the intrinsic calibration parameters matrix. Using Eq. (iini, the cova- 
riance of E can be computed as 

5(A^FA) 5(A^FA)'^ 

= dF dF ’ 

where is the derivative of A^FA at F and yip is the covariance of F 

0. The derivative of A^FA with respect to the element fij of F is equal to 

d(A^FA) _ dF 

dftj df,j 

Matrix is such that all its elements are zero, except from that in row i and 

Ojij 

column j which is equal to one. Given the covariance of E, the covariance of R 
is then computed from 

5(UWV'^) a(uwv^)'^ 

M 9E 

The derivative of UWV^ with respect to the element of E is given by 



a(uwv^) au 



den 



de. 



WV^ 



uw 



dV^ 

dciq 



The derivatives and in the above expression are computed with the 
aid of the proposed differentiation method. 

Regarding the covariance of T, let V3 denote the vector corresponding to 
the third column of V. The covariance of translation is then simply 



Tx — 



dVs 

dF 



dVs^ 

dE 



with being again computed using the proposed method. 



4 Conclusions 

The Singular Value Decomposition is a linear algebra technique that has been 
successfully applied to a wide variety of domains that involve matrix computa- 
tions. In this paper, a novel technique for computing the Jacobian of the SVD 
components of a matrix with respect to the matrix itself has been described. 
The usefulness of the proposed technique has been demonstrated by applying it 
to the estimation of the uncertainty in three different practical vision problems, 
namely self-calibration, epipole estimation and rigid 3D motion estimation. 
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Abstract. This paper focuses on matching ID structures by variational 
methods. We provide rigorous rules for the construction of the cost func- 
tion, on the basis of an analysis of properties which should be satisfied by 
the optimal matching. A new, exact, dynamic programming algorithm 
is then designed for the minimization. We conclude with experimental 
results on shape comparison. 



1 Introduction 

In signal processing, or image analysis, situations arise when objects of inter- 
est are functions 9 defined on an interval / C IR, and taking values in 1R‘^. 
The interval / may be a time interval (with applications to speech recognition, 
or to on-line handwritten character recognition), a depth interval (for example 
to analyze ID geological data), or arc-length (with direct application to shape 
recognition, 2D or 3D curve identification and comparison, etc. . . ) 

Comparing these “functional objects” is an important issue, for identification, 
for retrieval in a database. Most of the time, the problem is intricately coupled 
with the issue of matching the functions. The matching problem can be described 
as “finding similar structures appearing at similar places (or similar times)”: 
given two “objects”, 9 and 9' , expressed as functions defined on the same interval 
/, the issue is to find, for each x G I, some x' G I such that x ~ x' and 
9{x) ~ 9' {x'). If every point in one curve is uniquely matched to some point in 
the other curve, the matching is a bijection (j) : I I, and the problem can 
be formulated as finding such a 4> such that (() ~ id (where id is the identity 
function x i->- x) and 9 ~ 9' o (j). 

Since both constraints may drag the solution to opposite directions, a com- 
mon approach is to balance them by minimizing some functional Lg^gi((j)). One 
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simple example is (letting (j) = 

Ls,g'{4>) = J (j)^ dx + fj, J {9{x) — 6' o cj){x))^dx . (1) 

Many functionals which are used in the literature fall into this category, with 
some variations (see, for example | 2 |), in spite of the fact that this formulation 
has the drawback of not being symmetrical with respect to 9 and 9’ (matching 
9 to 9' or 9' to 9 are distinct operations). 

As an example of symmetric matching functional, let us quote [[Q, with 
(among other proposals) 



L 0 , 0 '{(j)) = / \(p{x) — l\dx + fj, / \9{x) — (p{x)9' o (j)(x)\dx 



or pni) with 



L 0 , 0 >{ 4 >) =length{I) - / y 4 >{x) 



cos 



9{x) —& o (j)[x) 



dx . 



( 2 ) 

( 3 ) 



The last two examples provide, after minimization over (j), a distance d(9,9') 
between the functions 9 and 9' . 

In this paper, all the matching functionals are associated to a function F 
defined on ]0,+oo[xIR‘^ x IR’^, letting 



L 0 . 0 '{ 4 >) = J F{ 4 >{x), 9 (x), 9’ o (j){x))dx 

and the optimal matching corresponds to a minimum of L. To fix the ideas, we 
also let / = [0, 1]. 

Our first goal is to list some essential properties which must be satisfied 
by the matching functionals, and see how these properties can constrain their 
design. 



2 Designing Matching Functionals 

Let a function F be defined on ]0, +oo[xIR‘^ x . We specify the problem 
of optimal matching between two functions 9 and 9' , defined on [0, 1], and with 
values in IR'^ as the search of the minimum, among all increasing diffeomorphisms 
of [0, 1], of the functional, 

L0fi'{4>) = [ F{^{x),9{x),9' o cj){x))dx 
Jo 

Note that the functional Lg^gi is only altered by the addition of a constant A + /r 
if F is replaced hy F + for some real numbers A and /r. Since this does 

not affect the variational problem, all the conditions which are given below on 
F are implicitly assumed to be true up to such a transform. 
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2.1 Convexity 

The first property we introduce can be seen as technical but is nonetheless 
essential for the variational problem. It states that F must be a convex function 
of 4>. From a theoretical point of view, this is almost a minimal condition for the 
well-posedness of the optimization. It is indeed proved in ^ that this condition 
is equivalent to the fact that the functional Lg^gi is lower semi-continuous as a 
function of 4> (in suitable functional spaces): lower-semi continuity must indeed 
be considered as a weak constraint for minimization. Of course, this assumption 
does not imply that Lg^g/ is convex in (f>. We state the convexity condition for 
future reference: 

[Convex] for all u,v & 1R‘^, ^ i— > F(^, u, v) is convex on ]0, -|-oo[. 

We shall see later that this assumption also has interesting numerical conse- 
quences, in particular when the functions 9 and 9’ are piecewise constant. 



2.2 Symmetry 



The next property we introduce is symmetry. In most of the applications, there 
are no reasons to privilege one object rather than the other, which implies that 
the optimal matching should not depend upon the order in which the functions 
9 and 9' are considered. We thus aim at the property that, for any functions 9 
and 0', 

(p = argminLg^g' = argminLgyg 



Since ^ 

Lgi,g{4>~^)= [ p{x)F (^^,9' o(j){x),9{x)] dx 
Jo \<P{x) J 

A sufficient condition for symmetry is 

[Symmetry] For all {^,u,v) s]0, -|-oo[xIR‘^ x one has F{^,u,v) = 



It is very important to check that this condition is compatible with the first 
one. This fact is a consequence of the next lemma 



Lemma 1. A mapping f :]0,-|-oo[^ IR is convex if and only if f* : f i-^- C/(l/C) 
is convex 



Proof. We know that a function is convex if and only if it can be expressed as the 
supremum of some family of affine functions: f{x) = supj{/i(a:)} where each fi is 
affine, fi{x) = aiX + Pi. Then, for all a; > 0, f*{x) := xf{l/x) = s\xp>^{ai + fJix} ^ 
which proves that /* is convex. 

In the general case, we let F*{f,u,v) = ^F{l/^,v,u), so that the symmetry 
condition becomes F = F* . We let F® be the symmetrized version of F, F® = 
F -I- F*. Lemma 0 implies that F® satisfies [Convex] as soon as F satisfies it. 
Returning to example for which 



F{^,u,v) = ^'^ + p{u - v)'^ 
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we have 

u, ?;) = i + fj.{l + 0{u- vf 

and the symmetrized matching functional is 

Lg,g'{cj)) = f ( dx + [ {l + (i){x)){9{x) -0ocj){x))'^dx 
Jo V 9{x)y Jo 



2.3 Consistent Self-Matching 

Another natural condition is that, when comparing a function 9 with itself, the 
optimal (/> should he 4> = id. In other terms, one should have, for all functions 9, 
and all diffeomorphisms (j) 



F{(j){x),9{x),9 o (j)(x))dx > / F(l,9{x),9{x))dx 



( 4 ) 



Making the change of variable y = 4>{x) in the first integral and letting 
ij} = 4>-^, (0) yields, for any diffeomorphism ip, and for all 9 

f F* {ij}{x),9{x),9 o ij;(x))dx > f F{l,9{x),9{x))dx 
Jo Jo 

so that, if m is true for F, it is also true for F* and thus for F® (one has 
F*(l, u, u) = F{1, u, u) for all u). This shows that our conditions are compatible. 



We use a more convenient, almost equivalent, form of (Rl) : 

[Self-matching] There exists a measurable function A : 1R‘^ M such that, 
for all ^ > 0, M, u G 

F{^, u, v) > F(l, u, u) + X{v)^ — X{u) 



We have 

Proposition 1. If F satisfies [Self- matching], then inequality 0 is true for 
all (j), 9. 

Conversely, if inequality m is true for all (f> and 9, and if F is differentiable 
with respect to its first variable at f = 1, then [Self-matching] is true. 

The first assertion is true by the sequence of inequalities: 

f F{(p{x),9{x),9 o (f>[x))dx > f F{l,9{x),9{x))dx 
Jo Jo 

x)X{9 o (p[x))dx — f X{9{x))dx 
Jo 

F(l, 9{x), 9{x))dx 

since fg (p(x)X(9 o fi(x))dx = X(9(x))dx by change of variables. The proof of 

the converse is given in the appendix. 
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2.4 Focus Invariance 



Additional constraints may come from invariance properties which are imposed 
on the matching. Whereas [Convex], [Symmetry] and [Self- matching] have 
some kind of universal validity, the next ones have to be, to some extent, appli- 
cation dependent. 

The invariance property we consider in this section will be called “focus 
invariance” . It states that the matching remains stable when the problem is 
refocused on a sub-interval of [0, 1]. 

To describe this, consider 9 and 9' as signals, defined on [0,1], and assume 
that they have been matched by some function (jf . Let [a, b] be a sub-interval of 
[0, 1] and set [a', 6'] = [4>* (a) , (j)* (b)] . To refocus the matching on these intervals, 
rescale the functions 9 and 0', to get new signals defined on [0, 1], which can 
be matched with the same procedure. Focus invariance states that this new 
matching is the same as the one which has been obtained initially. 

Let us be more precise. To rescale 9 (resp. 9'), define 9ab{x) = 9{a+ {b — a)x) 
(resp. 9'^,^,{x) = 9' {a' + {b' — a')x)), x S [0, 1]. Comparing these signals with the 
functional F yields an optimal matching which, if it exists, minimizes 

1 

F{^{x), 9a,b{x), 9’^, y o (j){x))dx (5) 



The original optimal matching between the functions 6 and 9^ clearly mini- 
mizes ^ 

/ F{^{y),9{y),9’ o(j){y))dy 



with the constraints 4>{a) = a' and cj>{b) = b' . Making the change of variables 
y = a+{b — a)x, setting V'(a;) = {ip{y) — a' ) / {b' — a' ) , this integral can be written 



{b-a)[ F{Xtp{x),9a,b{x),9'^, ,,,oil;{x))dx (6) 

Jo 

b' - a' 

with A = — . We say that F satisfies a focus invariance property if, for any 

b — a 

9 and 9' , the minimizer of o is the same as the minimizer of (El- 

One possible condition ensuring such a property is that F is itself (relatively) 
invariant under the transformation = (A^,m,u), that is, for some a > 0, 

for all ^ > 0, u,v G 1R‘^, 



F{X^,u,v) = X^F{^,u,v) 

or F{^,u,v) = ^°‘F{l,u,v). We state this condition 

[Focus] For some a > 0, F takes the form, for some function Fi defined on 
X F{^,u,v) = -^°‘Fi{u,v). 

For such a function, [Convex] is true if and only if, either a = 1, or a g] 0, 1[ 
and Fi > 0, or a g] — oo, 0[U]1, -|-oo[ and Fi < 0. To ensure [Symmetry], one 
needs a = 1/2 and Fi symmetrical. We thus get that 
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Proposition 2. The only matching functionals which satisfy [Symmetry] and 
[Focus] take the form 

= -^/lFl{u,v) (7) 

with Fi{u,v) = Fi{v,u). 

Such a function F satisfies [Convex] if and only if Fi(u,v) > 0 for all u 
and V. 

It satisfies [Self-matching] if, for all u,v € , 

Fi{u,v) < ^jFi{u,u)Fi{v,v) (8) 

Proof. It remains to prove the last assertion. For [Self- matching], we must 
have, for some function A, 

- A(m) - Fi (m, u) = min(- A(u)^ - \^Fi {u, v)) 



For a fixed v, —X{v)f — \/^Fi{u,v) has a finite minimum in two cases: first, 
if X{v) < 0, and this minimum is given by Fi(m,u)^/(4A(u)) and second, if 
A(u) = Fi{u,v) = 0. In the first case, we have 



— A(m) — Fi{u, u) 



Fi{u,v)'^ 
mm — - — 

1 !, A ( i !)>0 4A('u) 



( 9 ) 



In particular, taking v = u, one has, if X{u) > 0, 

Fi{u, uY + 4A(u)Fi(u, u) + 4(A(u))^ < 0 



which is possible only if Fi{u, u) = — 2A(u). Given this fact, which is true also if 
A(m) = 0, O clearly implies (0- 



2.5 Scale Invariance for Shape Comparison 

Focus invariance under the above form is not a suitable constraint for every 
matching problem. Let us restrict to the comparison of plane curves, which has 
initially motivated this paper. In this case, the functions 6 typically are geomet- 
rical features computed along the curve, expressed in function of the arc-length. 
In such a context, focusing should rather be interpreted from a geometrical 
point of view, as rescaling (a portion of) a plane curve so that it has, let’s say, 
length 1. But applying such a scale change may have some impact not only on 
the variable x (which here represents the length), but also on the values of the 
geometric features 0. In H31, for example, the geometric features were the ori- 
entations of the tangents, which are not affected by scale change, so that focus 
invariance is in this case equivalent to geometric scale invariance. Letting k be 
the curvature computed along the curve, the same invariance would be true if 
we had taken 9 = rf j r? (which is the “curvature” which characterizes curves up 
to similitudes). But if we had chosen to compare precisely Euclidean curvatures, 
the invariance constraints on the matching would be different: since curvatures 
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are scaled by A ^ when a curve is scaled by A, the correct condition should be 
(instead of [Focus]): 

F{X^,Xu,v) = X°‘F{^,u,v) 

This comes from rescaling only the first curve. Rescaling the second curve yields 

F{X^,u,v/X) = X^F{^,u,v) 

Note that, if the symmetry condition is valid, we must have f3 = 1 — a, which 
we assume hereafter. 

One can solve this identity, and compute all the (continuously differentiable) 
functions which satisfy it. This yields functions F of the kind 

F(e,u,u) = i7(C-)u“u“-i 
u 

Note that, since F should be convex as a function of FI itself shoud be 
convex. The symmetry condition is ensured as soon as xFl{l/x) = H(x) for all 
X. One choice can be 

F{^,u,v) = - u\ . 

which satisfies [Convex], [Symmetry] and [Self- matching]. 

Many variations can be done on these computations. The first chapters of Pj 
contain information on how devising functionals which satisfy given criteria of 
invariance. 

2.6 Remark 

A similar, “axiomatic” approach has been taken in in which a set of con- 
straints has been proposed in the particular case of matching curvatures for 
shape comparison. They have introduced a series of conditions, in this context 
(which turned out, however, to be incompatible). The only common condition 
with our paper is the symmetry, since we have chosen not to discuss the trian- 
gular inequality. Note, also, that scale invariance is not taken into account in 

P 

3 Existence Results 

One essential issue, for the matching problem, is to know by advance that the 
associated variational problem has a solution. It is also interesting to be able 
to analyze a priori some properties of the optimal matching. These problems 
have been addressed in H2|, in the particular case of focus invariant symmetric 
matching, that is, for F of the kind F{^,u,v) = —y^Fi{u,v) (the objective was 
in particular to be able to deal with functionals like O). However, we believe 
that, with a not so large effort, the results can be extended to a wider range of 
cases. 

In general, it is (relatively) easy to prove that the variational problem has a 
solution in a larger space than only the diffeomorphisms of [0, 1]. In (see also 
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Id)) we have extended the functional to the set of all probability measures on 
[0, 1], replacing ^ by the Radon-Nicodym derivative with respect to Lebesgue’s 
measure. Using a direct method (cf. 0), a minimizer of the extended functional 
could be shown to exist. The hardest part of the study is then to give conditions 
under which this minimizer yields a correct matching, in the sense that it pro- 
vides at least a homeomorphism of [0, 1]. We now state the results, in the case 
when F(^, u, v) = — v^Fi(u, v). Fixing 0 and 9', we let /(a;, y) = Fi{9{x),0' (y)), 
and ^ 

= - [ \Jl>{x)f{x,(j){x))dx 
Jo 

where (f> should be understood as a Radon-Nicodym derivative of the measure 
defined by ^([0,a;[) = 4>{x). To simplify, we assume that 

Notation For a, 5 S [ 0 , 1 ]^ denote by [a,b] the closed segment {a + t{b — 
a),0 < t < 1}, and by ]a,b[ the open segment [a,b] \ {a,b}. A segment 
is horizontal (respectively vertical) if Q 2 — &2 (respectively oi = 5i), 
where a = (oi, 02) and b = (61, 62). 

Notation We let Af = / f{x,x)dx, and be the set 

Jo 



% = I (a;, 2 /) G [0,1]^ : \x - y\ < 

We have 

Theorem 1. Assume that f >0 is bounded, upper semi-continuous, and 

— there exists a finite family of closed segments ([aj,bj])j£j such that each 
of them is horizontal or vertical and f is continuous on [0, 11^ \ F where 

— there does not exist any non empty open vertical or horizontal segment ]a, 5[ 
such that ]a, 6[c 17/ and f vanishes on ]a,b[. 

Then there exists (f>* € HomA such that Uf{<p*) = min {17/ (</>), (^ € ffom~*'}. 
Moreover, if (f is a minimizer ofUf, one has, for all x G [0,1], (x,f>(x)) G 17/. 

We have denoted by Horn'*' the set of (strictly) increasing homeomorphisms on 
[0, 1]. We now pass to conditions under which the optimal matching satisfies 
some smoothness properties. 

Definition 1. We say that f : [0, 1]^ — > ffi js Holder continuous at (y,x) if there 
exist a > 0 and C > 0 such that 

\f{y\x) - f{y,x)\ < Cmaxdy - |a;' - a;|“) (10) 

for any {y',x') G [0,1]^ such that {y',x') A iv^x). 

We say that f is locally uniformly Holder continuous at {yo,xo) if there exists 
a neighborhood V of(yo,xo) such that, f is Holder continuous at all (y,x) G V, 
with constants C and a which are uniform over V . 
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Theorem 2. Let f be a non-negative real-valued mesurable function on [0,1]^ 
and assume that Uf reaches its minimal value on Homf' at (f>* . Then for any 
xo € [0, 1], if f{{4>{xo),xo) > 0 and if f is Holder continuous at {xq,4>{xo)), then 
(jf is differentiable at xq with strictly positive derivative. 

Moreover, if f is locally uniformly Holder continuous, then 4>* is continuous 
in a neighborhood of xq ■ 



Theorem 3. Assume that f is continuously differentiable in both variables. Let 
4> G Hom'^ be such that Ufftp) = min{[/j('(/i) | € Homi^f and that, for all 

X G [0,1], one has f{x,4>{x)) > 0. Then, (j) is twice continuously differentiable. 

4 Numerical Study 

We now study the numerical problem of minimizing a functional of the kind 

U{4>) = f F{^{x),9{x),9' o <p[x))dx 

Jo 

in (j), for two functions 9 and 9' defined on [0, 1], with values in (f> is an 
increasing diffeomorphism of [0, 1]. 

We assume [Convex], so that F is convex in its first variable. This condition 
will allow us to devise a dynamic programming algorithm to compute exactly 
the optimal matching in the case of discretized functions 9 and 9' . 

We recall the notation, for all ^ > 0,u,v G F*{^,u,v) = ^F{l/^,v,u). 
We start with a very simple lemma, which is implied by the first assumption: 



Lemma 2. Assume [Convex]. Let 0 < a < b < 1, and 0 < a' < b' < 1. Fix 
u,v G IR’^ . Then the minimum in (f> of 

rb 



F{(f){x),u, v)dx 



with constraints 4>{a) = a' and (p{b) = b' , is attained for (j) linear: 4>{x) = a'-\- 
{x — a){b' — a') / {b — a) . 

Moreover, F is convex in 4> if and only if F* is convex in (j) 

Proof. For any convex function G, and for any (j) such that 4>{a) = a' and (f(b) = 
6', the fact that 



G{<p{x))dx > {b — a)G{— ). 



is a consequence of Jensen’s inequality. 
The second assertion is lemma D 
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Now, we assume that d and Q' are piecewise constant. This means that there 
exist subdivisions of [ 0 , 1 ], 0 = sq < si < • • • < Sm-i < Sm = 1 and 0 = Sq < 

< • • • < s'^_i < s'„ = 1, and some constant values 9i, . . . , 9m, 9[, . . . ,9'^ in 
such that 9{x) = 9i on and 9'{x) = 9[ on 

We now get a new expression for U{(j>) in this situation. For this, we let 
Ip = We also denote n = 4>{si) and r' = tp{s'P). We have 

TTL p S'l p 

U{(p) = '^ F{(p{x),9i,9' o (p{x))dx = '^ ip{x')F{l/'tp{x'),9i,9'{x'))dx' 

i=l 

m ^Ti pTi As j 

= E/ F*ii;{x'),9'{x'),9,)dx' = F*{i;{x'),9r,9,)dx' 

i=i i=i j=i Jn-iVs'_i 



Thus, by lemmaQ, the minimizer of U can be searched over all piecewise linear 
(j). Moreover, (p has to be linear on every interval of the kind ]rj_ i V s' _ i , A s' [. 
For such a (p, we have 



i=l j=l 



n-1 



Vs'_i)F*( 



n A s' 



- t'^-i V Si- 1 

- Ti_i V s' _i 



0'„9.) 



So that U is only a function of t := (ti, . . . ,Tm-i) and r' := (t{, . . . , t4_i), 
and the numerical procedure has to compute their optimal values. With a slight 
abuse of notation, we write U{(p) = 

The function U (r, r') can be minimized by dynamic programming. Let us give 
some details about the procedure. To have some idea on the kind of functions (p 
which are searched for, place, on the unit square [0, 1]^, the grid G which contains 
all the points m = (s, s') such that either s = Si for some i, or s' = s' for some j. 
We are looking for continuous, increasing mappings (p which are linear on every 
portion which does not meet G (see figure 0 . 

On the set G, let Hij be the horizontal segment s' = s'-,Si_i < s < s^. 
Similarly, let Vij be the vertical segment s = Si,s'_^ < s' < s'-. Let Gij = 
Hij LI Vij. If M £ G, denote by iM,jM the pair i,j such that M G Gij 

If M = (s,s') and P = in G, write M < P \i s < t and s' < t' . 

For M £ G, let 1P(M) be the set of points M' £ G such that M' < M and if 
M £ Gij for some i,j then M' £ Vij-i U Hi-ij. Finally, for M = (s, s') G G and 
P={t,t')£ T(M), let 



V{P,M) = pt' -s')F*{^r^, 9’^^, 9,^) = {t-s)FC-^J, 



t' — s' 



t — s 






We can reformulate the problem of minimizing U into the problem of finding 
an integer p and a sequence Mq = (0, 0), Mi, . . . , Mp_i, Mp = (1, 1) G G such 
that, for all i, Mi_i G IP(Mi), which minimizes 



p 

L{Mo,...,Mp) :=^y(M,_i,M,). 
2=1 
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s(l) s(2) s(m-2) s(m-l) 



Fig. 1. Piecewise linear fnnction </> on the grid G 



For M G G, denote by §(M) the set of all sequences Mq = (0, 0), Mi, . . . , Mp_i, 
Mq = M such that for all i, Mi_i G Denote by W{M) the infimum of L 

over §(M). The dynamic programming principle writes, in our case: 

W{M)= sup {W{P) + V{P,M)) (11) 

P£7(M) 

This formula enables one to compute by induction the function W for all 
M G G. We have PF(0,0) = 0, and for all A: > 0, if W{M) has been computed 
for all M such that im + Jm < k, then using dlH), one can compute W{M) for 
all M such that im + Jm = k + 1. 

Dynamic programming has been widely used for speech recognition (dl) , or 
for contour matching (0, 0, 0, 0). As presented above our method involves 
no pruning, no constraint on the slope of the matching functional, unlike most 
of the applications in dynamic time warping. If (5 = 1/iV is the grid step for the 
discrete representation of [0, 1]^ (the points Mq, . . . , Mp will be assumed to have 
coordinates of the kind (k/N,l/N) for integer k,l between 0 and N), one can 
check that the complexity of the algorithm is of order N'^, for a complete global 
minimization. 

5 Experiments 

We present experimental results of curve comparison. The matched functions 
are the orientations of the angles of the tangents plotted versus the Euclidean 
arc-length of the curves (which are assumed to have length 1). The curves are 
closely approximated by polygons, and the algorithm of section Elis used. Thus, 
we are dealing with piecewise constant functions 9 : [0, 1] [0, 27 t[. 
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Since our representation is not rotation invariant, and since, for closed curves, 
the origins of the arc-length coordinates are not uniquely defined, we have first 
used a rigid alignment procedure similar to the one presented in Q- Matching 
is then performed on the aligned shapes. 

We have used several matching functionals. They are described in the cap- 
tions of the figures. 

The shapes have been extracted from a database composed with 14,000 ge- 
ographical contours from a map of a region of Mali. These contours are hand- 
drawn natural boundaries of topographical sites, and the comparison procedure 
is part of an application aiming at providing an automatic classification of the 
regions. 

For each pair of compared shapes, the results are presented in two parts: 
we first draw the (piecewise linear) optimal matching on [ 0 , 1 ]^, the grey- levels 
corresponding to the values of F(l, u, v) (or Ga(1, u, v)). On the second picture, 
we draw both shapes in the same frame, with lines joining some matched points. 
One of the shapes has been shrinked for clarity. 



1 


Vi 'w 









Fig. 2. Two comparisons within a set of six shapes from the database (each line 
ordered by similarity). The distance is the minimum of Lg qi using F{^,u,v) = 1 — 
V?|cos(^)|. 



6 Appendix 

We finish the proof of proposition n and show that 0) and the fact that F has 
a partial derivative in ^ at ^ = 1 imply [Self- matching] . For this, we consider a 
particular case. Take numbers 0 < 7 < /? < 1 and assume that 9 is constant on 
[0, y[, equal to rt G 1R‘^ and on [ 7 , (3] (equal to v). Let (j>{x) = a: for a; G [/3, 1], and 
4> be piecewise linear on [0,/3]. More precisely, we fix 7 * <7 and let (f> be linear 
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Fig. 3. Shape comparisons using F{^, u,v) = 1 — \/^ |cos | ■ 




Fig. 4. Shape comparisons using G\{^, u,v) = + |- + A(l + 5) sin^ "^ith A = 100 
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on [ 0 , 7 *], [ 7 *, 7 ] and [ 7 ,/?]. We also let 7 = ^( 7 ) and impose that 7 = ^( 7 *) 
(see fig.|^. 




Fig. 5. Analysis of (|3} in a particular case 
We thus have, on [0,/3]: 



^{x) = < 



' ^ on ]0,7*[ 

7-7 1 H. r 

— T on J7 ,7[ 



P~1 

If we apply (0, we get the inequality 



on ] 7 ,/ 3 [ 



l*F{-^iU,u)+{'y--f*)F{ ^ ^ ,u,v)+{f3-i}F{^:^,v,v) > -fF{X,u,u)+{{3-i}F{l,v,v) 
7* 7~7 P~7 

which yields 

(a-l*)F{^^,u,v)>-i[F*{l,u,u)-F*C—,u,u)]+{p^)[F{l,v,v)-F{^,v,v)] 
7 — 7 7 P ~ 7 

For ^ yf 1, let G{^,u,v) = {F{^,u,v) — F{l,u,v))/ — 1) and G*{^,u,v) = 
(F*{^,u,v) — F*{l,u,v))/{^ — 1). We have 



(7_7*)jr( '^ > 7(1-— )G*( — ,u,u) + (P-7)(l- ^- 7 )G(^- 7 , u, u) 

7—7 7 7 P~1 P~1 

= i'l-7*)G*{ — ,u,u) + {j--f)G{^^-^,v,v) 

7 P-7 



7 - 7 ' 
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or, letting C = (7 ~ l)l{l ~ 1 *), 6 = 1 * h, 6 = (/? ~ l)l iP ~ 7), 

F(^,u,v)> G*{^i,u,u) +^G {^2, v,v) 

and one can check that, by suitably choosing the values of 7, 7, 7*, this inequality 
is true for any ^ > 0 , and < 1,^2 < 1 - Assume now that F is differentiable 
with respect to its ^ variable at ^ = 1 . For u G IR’^, denote by A(it) 

dF 

A(w) = —{^,u,u) 

This implies that G and G* can be extended by continuity to ^ = 1 , and, since 
G*{^,u,u) = F{l/^,u,u) — ^G{l/^,u,u), one gets, in letting and ^2 tend to 
1 : 

F(^, u, v) > F{ 1 , u, u) + ^\{v) — \{u) 
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Abstract. This paper is concerned with the simulation of the Partial 
Differential Equation (PDE) driven evolution of a closed surface by me- 
ans of an implicit representation. In most applications, the natural choice 
for the implicit representation is the signed distance function to the 
closed surface. Osher and Sethian propose to evolve the distance fun- 
ction with a Hamilton-Jacobi equation. Unfortunately the solution to 
this equation is not a distance function. As a consequence, the prac- 
tical application of the level set method is plagued with such questions 
as when do we have to ’’reinitialize” the distance function? How do we 
’’reinitialize” the distance function? Etc... which reveal a disagreement 
between the theory and its implementation. This paper proposes an al- 
ternative to the use of Hamilton-Jacobi equations which eliminates this 
contradiction: in our method the implicit representation always remains 
a distance function by construction, and the implementation does not 
differ from the theory anymore. This is achieved through the introduc- 
tion of a new equation. Besides its theoretical advantages, the proposed 
method also has several practical advantages which we demonstrate in 
three applications: (i) the segmentation of the human cortex surfaces 
from MRI images using two coupled surfaces mi, (ii) the construction 
of a hierarchy of Euclidean skeletons of a 3D surface, (iii) the reconstruc- 
tion of the surface of 3D objects through stereo US] 



1 Introduction and Previous Work 



We consider a family of hypersurfaces S{p,t) in where p parameterizes the 
surface and t is the time, that evolve according to the following PDE: 






= 13M 



( 1 ) 



with initial conditions S{t = 0) = Sq, where M is the inward unit normal vector 
of 5, /? is a velocity function and Sq is some initial closed surface. 

Methods of curves evolution for segmentation, tracking and registration were 
introduced in computer vision by Kass, Witkin and Terzopoulos [El . These evo- 
lutions were reformulated by Malladi, Sethian et al. m, by Caselles, Kimmel 
and Sapiro j?) and by Kichenassamy et al. ca in the context of PDE-driven cur- 
ves and surfaces. There is an extensive literature that addresses the theoretical 
aspects of these PDE’s and offers geometrical interpretations as well as results 
of uniqueness and existence mwi- 



D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 588-^23 2000. 
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Level set methods were first introduced by Osher and Sethian in m in the 
context of fluid mechanics and provide both a nice theoretical framework and 
efficient practical tools for solving such PDE’s. In those methods, the evolution 
m is achieved by means of an implicit representation of the surface S. 

The key idea in Osher and Sethian’s approach is to introduce the function 
It : X K — >■ R such that 



u{S,t) = 0yt (2) 

By differentiation (and along with Af = — and (CQ), we obtain the Hamilton- 
Jacobi Q equation: 

|^=/3|Vw| (3) 

with initial conditions u(-,0) = uo(-), where uq is some initial function ^ R. 
such that uo(iSo) = 0. It has been proved that for a large class of functions u 
and uo, the zero level set at time t of the solution of 0 is the solution at time 
f of (P). 

Regarding the function uq, it is most often chosen to be the signed distance 
function to the closed surface 5 q. This particular implicit function can be cha- 
racterized by the two equations: 

{a: G R^, Mo(a^) = 0} = >5o and |Vito| = 1 

Indeed, the magnitude of the gradient of uq is equal to the magnitude of the 
derivative of the distance function from So in the direction normal to Sq, i.e., it 
is equal to 1. 

It is known from jSj that the solution u of (0 is not the signed distance 
function to the solution S of HD- This causes several problems which are analyzed 
in the following section. 

It is also important to notice that (3 in 0 is defined in R^ whereas in dU it 
is defined on the surface S. The extension of (3 from S to the whole domain R^ 
is a crucial point for the analysis and implementation of 0 - There are mainly 
two ways of doing this. 

(i) Most of the time this extension is natural. For example, \i (3 = Hs, the 
mean curvature of S in o, one can choose P = Hu, the mean curvature of the 
level set of u passing though x in 0 - 

(ii) In some cases PSETE] , this extension is not possible. Then one may 
assign to P{x) in ® the value of P{y) in 0 where y is the closest point to x 
belonging to S. The problem with this extension is that it hides an important 
dependence of /3 in (E| with respect to u and we show in section P that in this 
case 0 is not a Hamilton- Jacobi equation. 

The thrust of this paper is a reformulation of the level set methods introduced 
by Osher and Sethian in m to eliminate some of the problems that are attached 
to it, e.g. the need to reinitialize periodically the distance function or the need 
to “invent” a velocity held away from the evolving front or zero level set. 

Our work is closely related to ideas in m and M appendix] and proposes 
a new analysis to the problem of evolving euclidean distance functions. The 
implications of our work are both theoretical and practical. 



^ The difference between a Hamilton-Jacobi equation and a general first order PDF 
is that the unknown function (here u) does not appear explicitly in the equation. 
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2 Why the Classical Hamilton- Jacobi Equation of “Level 
Sets” Does Not Preserve Distance Function 

In this section, we suppose that (3 is extended as explained in (i). The fact that 
the solutions to Hamilton- Jacobi equations of the form o are not distance fun- 
ctions has been proved formally in (3- A convincing geometrical interpretation 
of this fact is now given through two short examples. 



2.1 First Example 

Let us consider the problem of segmenting a known object (an ellipse) in an 
image by minimizing the energy of a curve jS]- Let us force the initial curve to 
be exactly the solution (the known ellipse) and initialize uq to the signed distance 
function to this ellipse, then evolve u with the Hamilton- Jacobi equation Q. 

It is obvious that the zero level set of u (let us call Sq this ellipse) will not 
evolve, since it is the solution to m and f3{x G 5q) = 0. 

Notice however that replacing 0 by e G R in (|2I) implies by differentiation the 
same equation m, which means that the e level set of u (let us call this 
curve) also evolves according to ^ = PAf. In consequence, P{x G S^) ^ 0 and 
iSe evolves toward So in order to minimize its energy (cf. fig. Q)). This shows 




Fig. 1. All the level sets of u (shown as single curves) move towards the ellipse So in 
order to minimize their own energy with the effect that the distance function is not 
preserved. 



that the shock wave equation (0) requires that all the level sets of u should 
converge to the ellipse 5o and therefore that |Vu| increases dangerously. 



2.2 Second Example 

A point M with coordinate x G R and energy E{x) = ^ is moving along the 
real line in order to minimize its energy. We force the point M to be at xq yf 0 
at f = 0. The level set version of this problem is to define uq on the real line as 
uo(x) = X — Xo and to evolve u with the Hamilton- Jacobi equation ^ = x|^. 
The solution is u(x,t) = e‘x — xq. The figure (0 shows m at 3 time instants 
(0 = to < ti < t 2 ). The zero level set of u is indeed traveling to the origin O but 
the slope of it is = e* and increases exponentially in time. 
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u(x,t) 




-03 V 



Fig. 2. The point M moves on the horizontal line in order to minimize its energy 
E{x) = The function u, initially of slope 1, becomes more and more vertical. 



The second example is a rephrasing of what happens in the normal direction 
to the evolving curve in the first example. It is now obvious why driving all 
the level sets of u with m cannot conserve distance functions and in addition 
leads to unbounded values of |Vm|. In practical applications, one is compelled to 
“reinitialize” the implicit function it to be a distance function which is obviously a 
contradiction and which shows a gap between the theory and its real application. 

In the next section, we convince the reader that maintaining it as a distance 
function (i.e. such that | Vit| = I) during all the time of the evolution is definitely 
desirable, sometimes crucial. 



3 Why We Should Preserve the Distance Function 

There are at least two reasons for preserving the signed distance function to the 
evolving surface, a theoretical one and a practical one. 

(i) From the theoretical viewpoint, the implicit description of S (seen as a 
subset of and its signed distance function u are equivalent descriptions. 
Indeed, given any surface S, its signed distance function is uniquely defined. 
Conversely, any implicit function it satisfying |Vit| = 1 is the signed distance 
function to a surface plus a constant (this last constant is taken equal to 0 
on the surface) Since these descriptions are equivalent, one can transpose 
immediately properties of the first one into properties of the second one and vice 
versa. For example, u has converged if and only if S has converged (which 
is not true with Hamilton- Jacobi equation 0 according to the last section). 

Moreover, one can deduce interesting intrinsic properties of 5 by a local 
knowledge of it. In 0, it is proved that the second fundamental form of S can 
be computed using the derivatives of the squared distance function. In addition, 
some applications in medical image analysis such as the segmentation of the 
cortex using two coupled surfaces m assume that the distance between the 
surfaces is known at any time. As a last example, the computation of the skeleton 
of a surface requires the detection of the singularities of its distance function uni- 

(ii) From the practical viewpoint, the numerical approximation of the de- 
rivatives of u by finite differences requires the choice of a spatial step dx. One 
chooses a small dx if the slope (the gradient) of the function is large and a larger 
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dx if the function has small variations. Since level sets are most often implemen- 
ted on regular grids, it is more efficient to use the same step dx = 1 for each grid 
point. It is obvious that this approximation is more accurate if the norm of the 
gradient of u is known which is the case with distance functions since |Vm| = 1. 
Keeping |Vu| bounded assures that the derivatives of u are always computable 
without the need to “reinitialize” u. 

We now describe a new approach that preserves the signed distance function 
and therefore meets these two requirements. 

4 How to Preserve the Distance Function 



In this section, we suppose that uq = u(.,0) is initialized at t = 0 as the signed 
distance function to the initial surface So- 

The basic idea is to change equation in such a way that at each time 
instant u is the signed distance function to the solution S of O- In order to 
achieve this goal, we look for a function S : x R+ — >■ R such that ^ = B 

and which satisfies the two constrains: (i) x — > u{x , .) is a distance function, (ii) 
the zero level set of u evolves according to O- 

We express these constrains with the system of equations: 

r = /3 (a) 

| = B(b) (4) 

I |Vw| = 1 (c) 



where S|„=o denotes the restriction of B to the zero level set of u. By diffe- 
rentiating (^) and 0:), we obtain: 



V 




= Vi? and 



Vu 

W\ 



dVu 



= 0 



(5) 



using the Schwartz equality = V (fy), we get: 

Vm- VB = 0 (6) 

which, together with (jlr) and (Eb) determines the function B. Relation (0) states 
that the function B does not vary along the characteristics of u (the characteri- 
stics of u are the integral curves of Vu). It also means that the characteristics 
of u and B are orthogonal. 

In order to go one step further in the resolution of the system, we must recall 
an important property the characteristics of distance functions are 
straight lines (cf. fig. (0J). This implies that B is constant along straight lines. 
These lines (or rays) intersect the zero level set of it at a point where B is known 
according to (Eb). 

Given any point a; G R^, an equation of the characteristic of u passing through 
X is A — >■ a: — AVm. Since the distance of x to the zero level is u{x) and |Vm(x)| = 
1, the point y = x — uVu is on the zero level set of it. Notice that y is the 
closest point to x such that u{y) = 0. According to the last reasoning, we have 
B{x) = B{y) = (3{x — uVu). Therefore, the solution to the initial system is: 

du 



j3{x — liVit) 



(7) 



Level Sets and Distance Functions 



593 




Fig. 3. Characteristic curves of the field Vu. 



with initial condition u(.,0) = uo(-)- This equation is the main result of the 
paper. It was also found in |2S1 appendix] with a different reasoning. Notice that 
equation 0 is not a Hamilton- Jacobi equation since u appears in the right- 
hand side and plays a major role. An interpretation of m is the following: the 
zero level set of u is driven by ^ = /? as proposed by Osher and Sethian. The 
evolution of this particular surface geometrically defines (by propagation) the 
evolution of all other level sets. 

Remark 1. Equation (0 looks simple but is not. Consider for example the case of 
mean curvature flow: Q writes = div{'S7u{x — u{x,t)yu{x,t),t)), which 

is a priori not a PDE but a functional equation (Indeed, two different points 
in R." X M+ are considered, namely (x,t) and {x — uVu,t)). However, notice 
that u(x — uVm) = 0, 'S/u{x — uS/u) = ’Vu(x), and according to 0 , the second 
fundamental form at a; — mVu can be computed using the derivatives of u{x,t) 
up to the third order. This shows that for a large class of velocity functions (in 
particular for mean-curvature flow), 0 is indeed a PDE. 



Remark 2. One guesses that the integral version of equation 0 is the equa- 
tion u{S -I- AAf) = A Vt, A. This can be proved by differentiation with respect 
to t and A. It states that the surface parallel to S at distance A from S should 
be the A level set of u. This is to be compared to the constrain u{S, t) = 0 Vt 
introduced by Osher and Sethian. 



The uniqueness of the closest point y to x such that u{y) = 0 is only guaranteed 
if Wu{x) exists. The set of points of where Vu is not defined is called the 
skeleton of S (cf. fig. 0). Skeletons are very important in computer vision 
UIHEni. Since it turns out that they are a byproduct of our new proposed 
evolution, we describe in the next section an implementation of equation 0 in 
which special care is taken of the computation of the skeleton. 

5 Implementation 

In this section, we propose a straightforward implementation of the previous 
theory, u is initialized as the signed distance function to the initial surface. We 
fix M at a particular instant t and compute the real held B(x, t) = f3{x—uVu) on a 
narrow band ITTMirri of S. Once B is known, u can be updated by u{x, t + dt) = 
u{x,t) + B(x,t)dt. The computation of B is done in two steps corresponding 
respectively to equations (0) and 0 . The difficulty is that we work on a discrete 
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= 0 




Fig. 4. The skeleton of the zero level set is determined by the points where Vm is not 
defined. 



grid and this can have dramatic consequences if proper care is not taken of the 
sampling effects. 

In order to deal with those effects, we introduce some notations. Points of 
such that none of their coordinates is an integer will be denoted by lower case 
letters, e.g. x, and called real points. Points of where N is the set of integers, 
will be denoted by upper case letters, e.g. X, and called voxels. We can think of 
a; as a point falling in a cube formed by eight voxels. We note V (x) this set of 
eight voxels. 

If / is a function defined on R.^, and a; is a real point such that the values 
of / are known at all voxels of V{x), we note fi{x) the value of the trilinear 
interpolation at x. In detail, if x = (xi, X2, xs) = (ni + ei,n2 + £2,^13 + £3), 
where G N and 0 < £j < 1, then we have by a simple linear interpolation 
//(xi,X 2 ,X 3 ) = (l-£i)/(ni,X 2 ,X 3 ) + £i/(ni + l,X 2 ,X 3 ). By applying recursively 
this rule to /(ui,X 2 ,X 3 ) and /(ni + 1 ,X 2 ,X 3 ), one expresses fi{x) as a linear 
combination of the samples of / at the voxels of V(x), the weights being third 
order polynomials of the coordinates (ci, £2, £3). 

Let A{X) be the 26-neighborhood of the voxel X. Since generically the zero 
level set of u is composed of real points, we need to determine when a voxel X is 
adjacent to this zero level set. Consider the function defined on the voxels of 
the grid such that C„(X) = 0 if u{X) > 0 and C„(X) = 1 if u{X) < 0. A 
voxel X is said to be adjacent to the zero level set of u if 3Y G A{X), Cu(Y) ^ 
Cu{X). We call Z the set of voxels adjacent to the zero level set of u. We are 
now in position to describe the two steps of our computation. 

5.1 First Step: Computation of (3 on Z 

The first step is the computation of f3 on Z. These values are stored in a tempor- 
ary buffer called . There are two ways to do this. If (3 is defined on R.^, then one 
can assign B^{X) = P{X) MX £ Z.li P is only defined on the nodes of a mesh 
describing the zero level set of u, then one can assign B^{X) = !3{yi) MX G Z, 
where Vi is the closest node of the mesh to the voxel X. In both cases, the final 
value of B{X) is not the value of B^{X), as explained in the second step. 

Notice that the definition of Z ensures that if ui (x) = 0 then V{x) C Z and 
in consequence Bf{x) can be computed. 
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5.2 Second Step: Computation of B on the Narrow Band 

The purpose is to propagate the values of B from Z to the whole narrow band. 
This is done by B{X,t) = Bf{y,t) where ui{y) = 0 and y lies on the same 
characteristic of u than X. Computing directly y = X — vX/u is not robust since 
small errors in Xu may introduce larger errors (proportional to u) in y. Instead, 
we follow the characteristic passing through X by unit steps (cf. fig. 0): 



r y^ = x 

I _ _ { \iui{yn) <Q f^e^rnax{ui{yn),sign{ui{yn)))Xiu{yn) 

I |if > 0 then 

[ ui{yn) = 0 

This marching is done for each voxel in the narrow band, even those of Z. The 




Fig. 5. The computation of y{A) by y{A) = A — u{A)Xu{A) is potentially subject to 
large errors. For B, the characteristic line is followed by unit steps in order to avoid 
this error. 



computation of the march direction V;u(t/„) requires the evaluation of Vu at 
voxels of the grid. The choice of the numerical scheme for Xu{X) is crucial since 
it may introduce unrecoverable errors if X lies on the skeleton of S. Our choice 
is based on the schemes used in the resolution of Hamilton- Jacobi equations 
where shocks occur isnEi- These schemes use switch functions which turn on 
or off whenever a shock is detected. We explicit here our choice. Let D^u = 
k)—u{i,j, k) and D~u = u{i,j, k)—u{i—l,j, k), with similar expressions 
for Dy and Dz- We form the eight estimators U*, i = 1, . . . , 8 of Vit, namely: 

D^u= {p'^u,DyU,D^u) 

D^u= (^D^u,DyU,D~u) 

D^u= (^D~u,D~u,D~u) 
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In our current implementation we use Vu{X) = ArgMaxi(|_D*rt(X)|). Indeed, 
apart from points on the skeleton of S where Vu is undefined, |Vu(Ai)| which 
should be equal to 1 since u is a distance function is found to be in practice less 
than or equal to 1 depending on which of the operators I?* we use. Hence the 
direction of maximum slope at X is the direction of the closest point to X of the 
zero level set of u. The fact that the skeleton can be detected by comparing the 
vectors D^u, D^u , . . . , D^u is discussed in section ^21 It is possible to take into 
account other properties of u in order to design an even more specific scheme. 
For example, the stability can be improved with a scheme which garanties that 
the norm of Vrt will not vary. 

6 Applications 

We now describe three applications where our new method is shown to work 
significantly better than previous ones. 



6.1 Cortex Segmentation Using Coupled Surfaces 

We have implemented the segmentation of the cortical gray matter (a volume- 
tric layer of variable thickness (~ 3mm)) from MRI volumetric data using two 
coupled surfaces proposed in by Zeng et al. The idea put forward in m is 
to evolve simultaneously two surfaces with equations of the form du. An inner 
surface Sin captures the boundary between the white and the gray matter and 
an outer surface Sont captures the exterior boundary of the gray matter. The 
segmented cortical gray matter is the volume between these two surfaces. The 
velocities of the two surfaces are: 

Pin = f [I ~ lin) + C{Uout + P) ( 8 ) 

Pout = f{I — lout) + C{Uin — e) (9) 

where / is the local gray intensity of the MRI image, I^n and lout are two 




Fig. 6. Shapes of the functions / and C in equations . 5i and Sd are two fixed 
tolerances. 



thresholds (/„ for the white matter and lout for the gray matter), e is the 
desired thickness and C and / have the shape of figure ®. 

Let us interpret equation (0). The first term /(/ — hn) forces the gray level 
values to be close to on Sin'- it is the data attachment velocity term. The 
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second term C{uout + e) models the interaction between Sout and Sm'- it is 
the coupling term. According to the shape of C, see figure ® , if locally the two 
surfaces are at a distance e = 3mm, then the coupling term has no effect {C = 0) 
and Sin evolves in order to satisfy its data attachment term. If the local distance 
between Sin and Sout is too small (< e) then C > 0 and Sin slows down in order 
to get further from Sout- If the local distance between Sin and Sout is too large 
(> e) then C < 0 and Sin speeds up in order to move closer to Sout- A similar 
interpretation can be done for (JOj). 

If these evolutions are implemented with the Hamilton-Jacobi equation m, 
then the following occurs: the magnitudes of the gradients of Uout and Uin in- 
crease with time (| S/Uout \> 1 and | |> 1). As a consequence, the estima- 

tion of the distance between Sin and Sout which is taken as Uin(x) for x on Sout 
and Uout{x) for x on Sin, is overestimated. Since the coupling term is negative 
in (0 and positive in 0, both Sout and evolve in order to become closer 
and closer from each other (until the inevitable reinitialization of the distance 
functions is performed). In other words, with the standard implementation of 
the level sets, the incorrect evaluation of the distance functions prevents the cou- 
pling term to act correctly and, consequently, also prevents the data attachment 
terms to play their roles. 

On the other hand, if these evolutions are implemented with our new PDE, 
then a much better interaction between the two terms is achieved since the data 
attachment term can fully play its role as soon as the distance between the two 
surfaces is correct (cf. fig. (|BI)). 

These results are demonstrated in the figure (0 which we now comment. 
Each row corresponds to a different 32 x 32 sub-slice of an MRI image. The first 
column shows the original data and some regions of interest (concavities) are 
labeled A, B and C. The second column shows a simple thresholding at and 
lout - The third column shows the cross-sections of and Sout through the slices 
if the coupling terms are not taken into account. This is why these curves have 
the same shape as in the second column. One observes that the segmented gray 
matter has not the wanted regular thickness. In the fourth column, the coupling 
terms are taken into account and the evolutions I® and ( 0 ) are implemented 
with Hamilton-Jacobi equation (0. One observes (in particular at the concavities 
indicated in the first column) that the distance constraint is well satisfied but 
the data attachment term was neglected. This is due to the fact that with m 
the distance between the two surfaces is overevaluated. In the fifth column, this 
same evolution is implemented with the new PDE introduced in this paper O- 
One can observe a much better result at concavities. This is due to the fact 
that the coupling terms stop having any effect as soon as the distance between 
the surfaces is correct allowing the data term to drive correctly the surfaces 
according to the gray level values. 



6.2 Extraction of the Skeleton of an Evolving Surface 

Skeletons are widely used in computer vision to describe global properties of 
objects. This representation is useful in tasks such as object recognition and 
registration because of its compactness um. 

One of the advantages of our new level set technique is that it provides, 
almost for free, at each time instant a description of the skeleton of the evolving 
surface or zero level set. 
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We show an example of this on one of the results of the segmentation de- 
scribed in the previous section. We take the outside surface of the cortex and 
simplify it using mean-curvature flow, i.e. the evolution ^ = HJ\f where H is 
the mean curvature. This fevolution is shown in the left column of figure El 
Since the distance function u to the zero level set is preserved at every step, it is 
quite simple to extract from it the skeleton by using the fact that it is the set of 
points where Vu is not defined |||. This is shown in the right column of figure 
irm Each surface is rescaled in order to occupy the whole image. 

The skeletons are computed using the distance function to the evolving sur- 
face as follows. We look for the voxels where the eight estimators D^u of Vu 
defined in section [^differ a lot and threshold the simple criterion: 

/ D^u Du \ 

where (., .) denotes the dot product of two vectors and Du = | 

This can be interpreted as a measure of the variations of the direction of Vu 
(which are large in the neighborhood of the skeleton). 

The results for the left column of figure (1 1 1 )ll are shown in the right column 
of the same figure where we clearly see how the simplification of the shape of 
the cortex (left column) goes together with the the simplification of its skeleton 
(right column). 

Note that because it preserves the distance function, our framework allows 
the use of more sophisticated criteria for determining the skeleton based on 
this distance function. 



6.3 Stereo Reconstruction via Level Sets 

In this last application, we show how our approach allows a faster convergence 
when solving the problem of stereo reconstruction from n > 2 views by means of a 
PDE-driven surface introduced by Faugeras and Keriven in HSl- More generally, 
the method described in this article offers significant savings each time the cost 
of the computation of the velocity term f3 is high. In the stereo application this 
velocity is given by: 



(3 = (j)H (10) 

where H is the mean curvature of S and <() is a measure of the local similarity 
of two of the n images of the tridimensional scene to be reconstructed. Let us 
qualitatively compare the cost (in time) of implementing stereo, equation (1 1 1 )ll . 
to the cost of implementing mean curvature flow for which f3 = H. 

(j) is derived from the normalized cross-correlation of two small sub-images 
(say of size n = 15 x 15). The number of multiplications (the most costly opera- 
tion) is 3n. Indeed, let a and b be two vectors of length n. The calculation of their 
normalized cross correlation HH mainly requires the calculation of the three dot 
products (a, a), (&, b) and (a, b) {i.e., for computing this criterion for two images 
of size 15 X 15, one reorders the pixels values in a vector of length n = 15 x 15, 
which shows that 675 multiplications are needed in this case). Computing H 
requires 25 multiplications. As a consequence, one iteration of (1 1 1 )l) is approxi- 
matively 30 times slower than one iteration of the mean curvature flow. This is 
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the main reason why the convergence of the stereo algorithm m is slow (about 
2h30 on a SunSO with n = 3 images and a 100^ grid) and why it is important to 
speed it up. 

Notice however that in our approach /3 is evaluated much fewer times (In- 
deed, j3 is evaluated only on Z and not on the whole narrow band). Moreover, 
the second step (section l^j) of our algorithm is independent of the specific ap- 
plication: this is why our method is so advantageous in applications where the 
computation of (3 is very expensive. 

The table of figure Q shows the considerable gain obtained in the experiment 
described in figure (0. 



Band half width ^ = /3| Vw] ^ = j3{x — aVu) gain 
4 8856 s 67301 24%" 

8 19748 s 10998 s 44% 



Fig. 7. Timings of the reconstruction on a SunSO. 



7 Conclusion 

We have proposed a new scheme for solving the problem of evolving through the 
technique of level sets a surface S{t) satisfying a PDE such as 0. This scheme 
introduces a new PDE, GD, that must be satisfied by the auxiliary function 
u{t) whose zero level set is the surface S{t). The prominent feature of the new 
scheme is that the solution to this PDE is the distance function to S{t) at each 
time instant t. Our approach has many theoretical and practical advantages 
that were discussed and demonstrated on three applications. Since the distance 
function to the evolving surface is in most applications the preferred function, 
we believe that the PDE that was presented here is an interesting alternative to 
Hamilton- Jacobi equations which do not preserve this function. 
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Fig. 9. The three images on the top left were taken simultaneously from different 
points of view. The image on the top right shows the initial surface (a sphere) with 
the three images back-projected on it. The reconstruction was obtained by deforming 
this sphere according to with /3 given by II 1 1 ID . The remaining 15 images show the 
resulting reconstruction from various points of view. 
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Fig. 10. Computation of the skeletons of a family of surfaces, see text. 
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Abstract. The medial surface of a volumetric object is of significant 
interest for shape analysis. However, its numerical computation can be 
subtle. Methods based on Voronoi techniques preserve the object’s topo- 
logy, but heuristic pruning measures are introduced to remove unwanted 
faces. Approaches based on Euclidean distance functions can localize 
medial surface points accurately, but often at the cost of altering the 
object’s topology. In this paper we introduce a new algorithm for com- 
puting medial surfaces which addresses these concerns. The method is 
robust and accurate, has low computational complexity, and preserves 
topology. The key idea is to measure the net outward flux of a vector field 
per unit volume, and to detect locations where a conservation of energy 
principle is violated. This is done in conjunction with a thinning process 
applied in a cubic lattice. We illustrate the approach with examples of 
medial surfaces of synthetic objects and complex anatomical structures 
obtained from medical images. 



1 Introduction 

Medial surface based representations are of significant interest for a number of 
applications in biomedicine, including object representation m, registra- 
tion m and segmentation Such descriptions are also popular for anima- 
ting objects in graphics and manipulating them in computer-aided design. 

They provide a compact representation while preserving the object’s genus and 
retain sufficient local information to reconstruct (a close approximation to) it. 
This facilitates a number of important tasks including the quantification of the 
local width of a complex structure, e.g., the grey matter in the human brain, 
and the analysis of its topology, e.g., the branching pattern of blood vessels in 
angiography images. Graph-based abstractions of such data have also been pro- 
posed Pj. Despite their popularity, the stable numerical computation of medial 
surfaces remains a challenging problem. Unfortunately, the classical difficulties 
associated with computing their 2D analog, the Blum skeleton, are only exacer- 
bated when a third dimension is added. 
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1.1 Background 

The 2D skeleton of a closed set A C 71^ is the locus of centers of maximal open 
discs contained within the complement of the set ^ . An open disc is maximal if 
there exists no other open disc contained in the complement of A that properly 
contains the disc. The medial surface of a closed set A C Ti? is defined in an 
analogous fashion as the locus of centers of maximal open spheres contained in 
the complement of the set. It is often referred to as the 3D skeleton, though 
this term is misleading since it is in fact comprised of a collection of 3D points, 
curves and surfaces P|. Whereas the above definition is quite general, in the 
current context we shall assume that the closed set A is the bounding surface 
of a volumetric object. Hence, this set will have two complementary medial 
surfaces, one inside the volume and the other outside it. In most cases we shall 
be referring to the former, though the development applies to both. 

Interest in the medial surface as a representation for a volumetric object 
stems from a number of useful properties: i) it is a thin set, i.e., it contains no 
interior points, ii) it is homotopic to the volume, iii) it is invariant under Euc- 
lidean transformations of the volume (rotations and translations), and iv) given 
the radius of the maximal inscribed sphere associated which each medial surface 
point, the volumetric object can be reconstructed exactly. Hence, it provides a 
compact representation while preserving the object’s genus and making certain 
properties explicit, such as its local width. 

Approaches to computing skeletons and medial surfaces can be broadly or- 
ganized into three classes. First, methods based on thinning attempt to realize 
Blum’s grassfire formulation |n| by peeling away layers from an object, while 
retaining special points mm- It is possible to define erosion rules in a lat- 
tice such that the topology of the object is preserved. However, these methods 
are quite sensitive to Euclidean transformations of the data and typically fail to 
localize skeletal or medial surface points accurately. As a consequence, only a 
coarse approximation to the object is usually reconstructed mm- 

Second, it has been shown that under appropriate smoothness conditions, the 
vertices of the Voronoi diagram of a set of boundary points converges to the exact 
skeleton as the sampling rate increases m- This property has been exploited to 
develop skeletonization algorithms in 2D m, as well as extensions to 3D I2ni 
177] . The dual of the Voronoi diagram, the Delaunay triangulation (or tetrahe- 
dralization in 3D) has also been used extensively. Here the skeleton is defined as 
the locus of centers of the circumscribed spheres of each tetraheda [nmi. Both 
types of methods preserve topology and accurately localize skeletal or medial 
surface points, provided that the boundary is sampled densely. Unfortunately, 
however, the techniques used to prune faces and edges which correspond to small 
perturbations of the boundary are typically based on heuristics. In practice, the 
results are not invariant under Euclidean transformations and the optimization 
step, particularly in 3D, can have a high computational complexity m- 

A third class of methods exploits the fact that the locus of skeletal or medial 
surface points coincides with the singularities of a Euclidean distance function to 
the boundary. These approaches attempt to detect local maxima of the distance 
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function, or the corresponding discontinuities in its derivatives |1 II 511 ,*H . The 
numerical detection of these singularities is itself a non-trivial problem; whereas 
it may be possible to localize them, ensuring homotopy with the original object 
is difficult. 

In recent work we observed that the grassfire flow leads to a hamilton-jacobi 
equation, which by nature is conservative in the smooth regime of its underlying 
phase space m- Hence, we suggested that a measurement of the net outward flux 
per unit volume of the gradient vector field of the Euclidean distance function 
could be used to associate locations where a conservation of energy principle was 
violated with medial surface points m Unfortunately, in practice, the resulting 
medial surface was not guaranteed to preserve the topology of the object, since 
the flux computation was a purely local operation. The main contribution of 
the current paper is the combination of the flux measurement with a homotopy 
preserving thinning process applied in a cubic lattice. The method is robust and 
accurate, has low computational complexity and is now guaranteed to preserve 
topology. There are other promising recent approaches which combine aspects 
of thinning, Voronoi diagrams and distance functions j1 . In spirit, our 

method is closest to that of PH10 but is grounded in principles from physics. We 
illustrate the algorithm with a number of examples of medial surfaces of synthetic 
objects and complex anatomical structures obtained from medical images. 



2 Hamiltonian Medial Surfaces 



We shall first review the hamilton-jacobi formulation used to simulate the eikonal 
equation as well as detect singularities in uniH]- Consider the grassfire flow 



dS 

Ih 



= M 



( 1 ) 



acting on a 3D surface S, such that each point on its boundary is moving with 
unit speed in the direction of the inward normal M . In physics, such equations are 
typically solved by looking at the evolution of the phase space of an equivalent 
Hamiltonian system. Since Hamiltonian systems are conservative, the locus of 
skeletal points (in 2D) or medial surface points (in 3D) coincides with locations 
where a conservation of energy principle is violated. This loss of energy can be 
used to formulate a natural criterion for detecting singularities of the distance 
function. 

In more formal terms, let D be the Euclidean distance function to the initial 
surface Sq. The magnitude of its gradient, ||VD||, is identical to 1 in its smooth 
regime. With q = (x,y,z), p = Dy, D^), associate to the surface S C R^, 
evolving according to Eq. Q] the surface C C R® given by 
C := {(x, y, z, D^, Dy, D^) : {x, y,z)€ S, Dl + = 1, p • q = 1}. 

^ Malandain and Fernandez-Vidal use a heuristic estimation of the singularities of a 
distance function to obtain an initial skeleton or medial surface, and then perform 
a topological reconstruction to ensure homotopy with the original shape. 
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The Hamiltonian function obtained by applying a Legendre transformation to 
the Lagrangian L = ||q|| is given by 



The associated Hamiltonian system is: 



P 



-^^( 0 , 0 , 0 ), 




~{D^,Dy,D^). 



(2) 



C can be evolved under this system of equations, with C{t) C R® denoting the 
resulting (contact) surface. The projection of C{t) onto R® will then give the 
parallel evolution of S at time t, S{t). Note that the interpretation of Eq. |2|is 
quite intuitive: the gradient vector field p does not change with time, and points 
on the boundary of the surface move in the direction of the inward normal with 
unit velocity. 

It is straightforward to show that all Hamiltonian systems are conserva- 
tive 1^ p. 172]: 

Theorem 1. The total energy i7(p,q) of the Hamiltonian system ^ remains 
constant along trajectories of 0). 

Proof. The total derivative of H{p,q) along a trajectory p(t), q(t) of (0 is 
given by 

dH _ dH . dH . _ dH OH dHdH_ 

dt dp ^ 9q 5p ’ 9q 5p ' 9q 

Thus iJ(p,q) is constant along any trajectory of 0. 

The analysis carried out thus far applies under the assumption of a central 
field of extremals such that trajectories of the Hamiltonian system do not inters- 
ect. Conversely, when trajectories intersect, the conservation of energy principle 
will be violated (energy will be absorbed) . This loss of energy can be used to for- 
mulate a robust and efficient algorithm for detecting singularities of the distance 
function D, which correspond to medial surface points. 

The key is to measure the flux of the vector field q, which is analogous to 
the flow of an incompressible fluid such as water. Note that for a volume with 
an enclosed surface, an excess of outward or inward flow through the surface 
indicates the presence of a source^ or a sink, respectively, in the volume. The 
latter case is the one we are interested in, and the net outward flux is related to 
the divergence of the vector field. More specifically, the divergence of a vector 
field at a point, div(q), is defined as the net outward flux per unit volume, as 
the volume about the point shrinks to zero: 

, , fc; < q, Af > ds 

div(q) = hmAv^o^^^ v (3) 



Here Av is the volume, S is its surface and J\f is the outward normal at each 
point on its surface. This definition can be shown to be equivalent to the classical 
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definition of divergence as the sum of the partial derivatives with respect to each 
of the vector field’s component directions: 



div(q) 



'9qxi 

9xi 



'9qxn 

dxn 



( 4 ) 



However, Eq. 0 cannot be used at points where the vector field is singular, and 
hence is not differentiable. These are precisely the points we are interested in, 
and Eq. 0 offers significant advantages for detecting medial surface points. In 
particular, the numerator, which represents the net outward flux of the vector 
field through the surface which bounds the volume, is an index computation on 
the vector field. It is not surprising that this is numerically much more stable than 
the estimation of derivatives in the vicinity of singularities. Via the divergence 
theorem, 

[ div(q)dv = f < q,A/’ > ds. (5) 

Jv Js 

Hence, the net outward flux through the surface which bounds a finite volume is 
just the volume integral of the divergence of the vector field within that volume. 
Locations where the flux is negative, and hence energy is lost, correspond to 
sinks or medial surface points. 

We now have a robust method for localizing medial surface points by discre- 
tizing Eq. O, and thresholding to select points with negative total outward flux. 
However, since the computation is local, global properties such as the preser- 
vation of the object’s topology, are not ensured. In our earlier work we have 
observed that the method gives accurate medial surfaces, but that as the thres- 
hold is varied new holes or cavities may be introduced and the medial surface 
may get disconnected m The remedy, as we shall now show, is to introduce 
additional criteria along the lines of those incorporated in to ensure that 
the medial surface is homotopic to the original object. 



3 Homotopy Preserving Medial Surfaces 

Our goal is to combine the divergence computation with a thinning process 
acting in the cubic lattice, such that as many points as possible are removed 
without altering the object’s topology. A point is called a simple point if its 
removal does not change the topology of the object. Hence in 3D, its removal 
must not disconnect the object, create a hole, or create a cavity. We shall adopt 
a formal definition of a simple point introduced by Malandain et al. m- First 
we review a few basic concepts in digital topology. 



3.1 Digital Topology 

In 3D digital topology, the input is a binary (foreground and background) image 
stored in a 3D array. We shall consider only cubic lattices, where a point is 
viewed as a unit cube with 6 faces, 12 edges and 8 vertices. For each point, three 
types of neighbors are defined: 
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Fig. 1. 6-neighborhoods, 18- neighborhoods and 26-neighborhoods in a cubic lattice. 



— 6-neighbors: two points are 6-neighbors if they share a face, 

— 18-neighbors: two points are 18-neighbors if they share a face or an edge, 
and 

— 26-neighbors: two points are 26-neighbors if they share a face, an edge or a 
vertex. 

The above definitions induce three types of connectivity, denoted n-connecti- 
vity, where n G {6, 18, 26}, as well as three different n-neighborhoods for x, called 
Nn{x) (see Figure Q]). A n-neighborhood without its central point is defined as 
N* = A^„(a;)\{a:}. A few more definitions are needed to characterize simple 
points: 

— An object A is n-adjacent to an object B, if there exist two points x G A 
and y G B such that x is an n-neighbor of y. 

— A n-path from xi to Xk is a sequence of points xi,X 2 , ■■■, Xk, such that for all 
Xi, 1 < i < k, Xi-i is n-adjacent to Xi. 

— An object represented by a set of points O is n-connected, if for every pair 
of points (xi,Xj) G O X O, there is a n-path from Xi to Xj. 

Based on these definitions, Malandain et al. provide a topological classifica- 
tion of a point a; in a cubic lattice by computing two numbers 1 1 7j : i) C*: the 
number of 26-connected components 26-adjacent to x in 0(1 N^q, and ii) C: the 
number of 6-connected components 6-adjacent to a: in O fl Nis- An important 
result with respect to our goal of thinning is that if C* = 1 and C = 1, the point 
is simple, and hence removable. When ensuring homotopy is the only concern, 
simple points can be removed sequentially until no more simple points are left. 
The resulting set will be thin and homotopic to the shape. However, the relati- 
onship to the medial surface will be uncertain since the locus of surviving points 
will depend entirely on the order in which the simple points have been remo- 
ved. In the current context, we have derived a natural criterion for ordering the 
thinning, based on the divergence of the gradient vector field of the Euclidean 
distance function. 




Divergence-Based Medial Surfaces 609 




Fig. 2. An endpoint is defined as the end of a 6-connected curve or the corner or rim 
of a 6-connected surface in 3D. For each configuration, there exists at least one plane 
in which the point has at least three background 6-neighbors. 



3.2 Divergence- Ordered Thinning 

Recall from Section El that a conservation of energy principle is violated at me- 
dial surface points. The total outward flux of the gradient vector held of the 
Euclidean distance function is negative at such points, since they correspond to 
sinksH More importantly, the magnitude of the total outward flux is proportio- 
nal to the amount of energy absorbed, and hence provides a natural measure 
of the “strength” of a medial surface point, which we shall use for numerical 
computations. The essential idea is to order the thinning such that the weakest 
points are removed first, and to stop the process when all surviving points are 
not simple, or have a total outward flux below some chosen (negative) value, or 
both. This will accurately localize the medial surface, and also ensure homotopy 
with the original object. Unfortunately the result is not guaranteed to be a thin 
set, i.e., one without an interior. 

One way of satisfying this last constraint is to define an appropriate notion 
of an endpoint in a cubic lattice. Such a point would correspond to the endpoint 
of a curve, or a point on the rim of a surface, in 3D. The thinning process would 
proceed as before, but the threshold criterion for removal would be applied only 
to endpoints. Hence, all surviving points which are not endpoints would not be 
simple, and the result would be a thin set. 

To facilitate this task, we shall restrict our definition of an endpoint to a 
6-connected neighborhood. In other words, an endpoint is either the end of a 

^ Conversely, medial surface points of the background correspond to sources, with 
positive total outward flux. 
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6-connected curve, or a corner or point on the rim of a 6-connected surface. 
It is straightforward to enumerate the possible 6-connected neighborhoods of 
the endpoint, and to show that they fall into one of three configurations (see 
Figure 0. Notice that for each configuration, there exists at least one plane 
in which the point has at least three background 6-neighbors. This gives us 
a necessary condition to determine if a point is an endpoint according to our 
definition. Note that before performing this check, one must also verify that the 
point is simple. 



3.3 The Algorithm 

The thinning process can be made very efficient by observing that a point which 
does not have at least one background point as an immediate neighbor cannot be 
removed, since this would create a hole or a cavity. Therefore, the only potentially 
removable points are on the border of the object. Once a border point is removed, 
only its neighbors may become removable. This suggests the implementation of 
the thinning process using a heap. A full description of the procedure can be 
found in Algorithm 0 



Algorithm 1 The divergence-ordered thinning algorithm. 
Part I: Total Outward Flux 

Compute the distance transform of the object D. 

Compute the gradient vector field VD. 

Compute the net outward flux of VD using Eq. El 
For each point P in the interior of the object 
Flux{P) = < Ni,VD{Pi) >, 

where Pi is a 26-neighbor of P and Ni is the outward 
normal at Pi of the unit sphere in 3D, centered at P. 
Part II: Homotopy Preserving Thinning 
For each point P on the boundary of the object 
if {P is simple) 

insert(P, Heap) with Flux{P) 
as the sorting key for insertion 
While (Heap. size > 0) 

P = HeapExtractMax(Heap) 
if {P is simple) 

if {P is not an endpoint) or (Flux{P) > Thresh) 

Remove P 

for all neighbors Q of P 
if {Q is simple) 
insert (Q, Heap) 

else mark P as a skeletal (end) point 
end { if } 
end { if } 
end { while } 
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We now analyze the complexity of the algorithm. The computation of the 
distance transform, the gradient vector field and the total outward flux are all 
0{n) operations. Here n is the total number of points in the 3D array. The 
implementation of the thinning is more subtle. We claim an 0{klog(k)) worst 
case complexity, where k is the number of points in the volumetric object. The 
explanation is as follows. At first store only the points that are on the outer 
layer of the object in a heap, using the total outward flux as the sorting key 
for insertion. The extraction of the maximum from the heap will provide the 
best candidate for removal. If this point is removable, then delete it from the 
object and add its simple (potentially removable) neighbors to the heap. A point 
can only be inserted a constant number of times (at most 26 times for a 26- 
neighborhood), and insertion in a heap, as well as the extraction of the maximum, 
are both 0{log{l)) operations, where I is the number of elements in the heap. 
There cannot be more than k elements in the heap, because we only have a total 
of k points in the volume. The worst case complexity for thinning is therefore 
0{klog{k)). Hence, the complexity of the algorithm is 0{n) + 0{klog{k)). 



4 Examples 




Fig. 3. First Column: Three views of a cube. Second Column: The correspondiug 
medial surfaces computed using the algorithm of EH). Third Column: The object 
reconstructed from the medial surfaces in the previous column. Fourth Column: The 
corresponding divergence-based medial surfaces. Fifth Column: The object recon- 
structed from the medial surfaces in the previous column. 




612 S. Bouix and K. Siddiqi 



We illustrate the algorithm with both synthetic data and volumes segmented 
from MR and MRA images. In these simulations we have used the D-Euclidean 
distance transform, which provides a close approximation to the true Euclidean 
distance function |^. The only free parameter is the choice of the divergence 
value below which the removal of endpoints is blocked. For all examples, this 
was selected so that approximately 25% of the points within the volume had a 
lower divergence value. 

Figures 01 and S compare our approach with the parallel thinning method 
introduced by Manzanera et al. P). The results reveal that both frameworks 
are robust, and yield structures that are homotopic to the underlying object. 
However, note that the latter method yields only a subset of the “true” medial 
surface for these data sets, and hence only a coarse approximation to the ob- 
ject is possible. In contrast, a near perfect reconstruction is possible from the 
divergence-based medial surfaces. 
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Fig. 4. First Column: Three views of a cylinder. Second Column: The correspon- 
ding medial surfaces computed using the algorithm of m - Third Column: The object 
reconstructed from the medial surfaces in the previous column. Fourth Column: The 
corresponding divergence-based medial surfaces. Fifth Column: The object recon- 
structed from the medial surfaces in the previous column. 



Recall from Section 0 that the computation of the distance transform, its 
gradient vector field and the total outward flux are all 0{n) operations. Hence, 
one may be tempted to simply threshold the divergence map below a certain 
(negative) value to obtain a medial surface extremely efficiently, as in [2E|- Un- 
fortunately, due to discretization, the result will not always be satisfactory. On 
the one hand, if the threshold is too high, slight perturbations of the boundary 
will be represented, and the resulting structure will not be a thin set. On the 
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Fig. 5. First Column: Four views of the ventricles of a brain, segmented from volume- 
tric MR data using an active surface. Second Column: The corresponding medial sur- 
faces obtained by thresholding the divergence map. Third Column: The divergence- 
based medial surfaces obtained using the same threshold, but with the incorporation 
of homotopy preserving thinning. Fourth Column: The ventricles reconstructed from 
the divergence-based medial surfaces in the previous column. 



other hand, lowering the threshold can provide a thin set, but at the cost of 
altering the object’s topology. This is illustrated in Figure 0 (second column) 
where the medial surfaces corresponding to the views in the first column are 
accurate and thin, but have holes. This motivates the need for the topological 
constraints along with the characterization of endpoints discussed in Section 0 
Observe that with the same threshold as before, the divergence-based thinning 
algorithm now yields a thin structure which preserves topology, Figure 0 (third 
column). The ventricles reconstructed from the medial surfaces in the fourth 
column are shown in the fifth column. 

Next, we illustrate the robustness of the approach on a (partial) data set 
of blood vessels obtained from an MRA image of the brain, in Figure 0 . The 
blood vessels have complex topology with loops (due to pathologies), and are 
already quite thin in several places. The bottom row illustrates the accuracy of 
the method, where the medial surfaces are shown embedded within the original 
data set. Generically these structures are thin sheets which approach 3D curves 
when the blood vessels become perfectly cylindrical. In a number of medical 
applications where the objects are tubular structures, an explicit reduction of 
the medial surface to a set of 3D curves is of interest miEm- There is a 
straightforward modification of our framework which allows this, provided that 
certain special points on the medial surface have been identified. The essential 
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Fig. 6. Top Row: Blood vessels segmented from volumetric MRA data, with magni- 
fied parts shown in the middle and right columns. Middle Row: The corresponding 
divergence-based medial surfaces. Third Row: The divergence-based medial surfaces 
(solid) are shown within the vessel surfaces (transparent). 



idea is to preserve such special points, but to remove all other simple points in 
a sequence ordered by their divergence values. This is illustrated for a portion 
of the vessel data in Figure [3 where the endpoints of 3 branches were selected 
as special points. Observe that the result is now composed of three 1 voxel wide 
26-connected 3D digital curves. 

As a final example, Figure 0 illustrates the medial surface of the sulcii of a 
brain, where we have shown an X, Y and Z slice through the volume. Observe 
that the medial surface is well localized, and captures the complex topology of 
the object’s shape. 
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Fig. 7. Left Column: Blood vessels segmented from volumetric MRA data, with a 
magnified portion shown in the second row. Middle Column: The divergence-based 
3D curves. Right Column: The divergence-based 3D curves are shown embedded 
within the vessel data. 



5 Conclusions 



We have introduced a novel algorithm for computing medial surfaces which is 
robust and accurate, computationally efficient, invariant to Euclidean transfor- 
mations and homotopy preserving. The essential idea is to combine a divergence 
computation on the gradient vector field of the Euclidean distance function to 
the object’s boundary with a thinning process that preserves topology. The cha- 
racterization of simple (or removable) points is adopted from [11 7] . but we have 
also introduced a notion of an endpoint of a 6-connected structure, in order that 
the algorithm may converge to a thin set. We have illustrated the advantages of 
the approach on synthetic and real binary volumes of varying complexity. 

We note that in related work, Malandain and Fernandez- Vidal obtain two 
sets based on thresholding a function of two measures, (j) and d, to characterize 
the singularities of the Euclidean distance function HE). The first set preserves 
topology but captures many unwanted details and is not thin, while the second 
set provides a better approximation to the skeleton or medial surface, but of- 
ten alters the object’s topology. The two sets are combined using a topological 
reconstruction process. Whereas empirical results have been good, the choice of 
appropriate thresholds are context dependent. More seriously, </> and d, as well 
as their combination, are all based on heuristics. 
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Fig. 8. Top Row: Medial surfaces of the sulcii of a brain, segmented from an MR 
image. The three columns represent X, Y and Z slices through the volume. The cross 
section through the medial surface in each slice is shown in black, and the object is 
shown in grey. Bottom Row: A zoom-in on a selected region of the corresponding 
slice in the top row, to show detail. 



In contrast, our method is rooted in a physics-based analysis of the gradient 
vector field of the Euclidean distance function, which shows that a conservation 
of energy principle is violated at medial surface points. This justifies the use of 
the divergence theorem to compute the total outward flux of the vector held, and 
to locate points where energy is absorbed. It should be clear that whereas we have 
focussed on the interior of an object, the medial surface of the background can be 
similarly obtained by locating points that act as sources, and have positive total 
outward flux. Furthermore, both medial surfaces can be located with sub-voxel 
accuracy by using the local gradient vector held to shift the final set of digital 
points. In related work we have demonstrated this idea for 2D shapes, where 
a similar framework was used to compute sub-pixel 2D skeletons and skeletal 
graphs unj. 

In future work we plan to incorporate the topological classification of ^3 
parse the medial surface and obtain a more abstract representation of it, e.g., 
as a graph. We shall also explore the possibility of finding necessary as well as 
sufficient conditions for defining endpoints in a cubic lattice. 
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Abstract. Consider the situation of a monocular image sequence with 
known ego-motion observing a 3D point moving simultaneously but along 
a path of up to second order, i.e. it can trace a line in 3D or a conic shaped 
path. We wish to reconstruct the 3D path from the projection of the 
tangent to the path at each time instance. This problem is analogue to the 
“trajectory triangulation” of lines and conic sections recently introduced 
in m, but instead of observing a point projection we observe a tangent 
projection and thus obtain a far simpler solution to the problem. 

We show that the 3D path can be solved in a natural manner, and li- 
nearly, using degenerate quadric envelopes - specifically the disk quadric. 
Our approach works seamlessly with both linear and second order pa- 
ths, thus there is no need to know in advance the shape of the path as 
with the previous approaches for which lines and conics were treated as 
distinct. Our approach is linear in both straight line and conic paths, 
unlike the non-linear solution associated with point trajectory P). 

We provide experiments that show that our method behaves extremely 
well on a wide variety of scenarios, including those with multiple moving 
objects along lines and conic shaped paths. 



1 Introduction 

There has been a recent drive towards extending the envelope of multi-view 3D 
reconstruction and Image-based Rendering beyond the static scene assumption 
in the sense of considering “dynamic” situations in which a point to be re- 
constructed, or view-morphed, is moving simultaneously while the camera is in 
motion |TfTT^ . In these situations only line-of-sight measurements are available, 
that is, optical rays from distinct camera positions do not intersect (triangulate) 
at a point in space (as in the conventional multi- view reconstruction paradigm), 
thus new techniques need to be introduced in order to handle these cases. This 
paradigm was named “trajectory triangulation” by m 

In this paper we consider a related problem defined as follows. Consider a 
point P moving in space along a path 7 which could either be a straight line or 
a conic in 3D, i.e., a path up to second order in space coordinates. The path 7 
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is observed by a generally moving camera whose ego-motion (camera projection 
matrices) is known. The observation takes the form of the projections of tangents 
to the curve 7. In other words, in each image we see one (or more) tangents of 7 
as a line in the image plane (we refer to this as “tangent-of-sight” measurement) 
and our task is to reconstruct 7 from the image measurements across multiple 
(at least 2) views. See Fig. Q]for illustration of the problem. 




Fig. 1. The conic/line path 7 is observed by a generally moving camera where each 
image views the projection of a local tangent to the curve. The task is to reconstruct 
7 from such “tangent-of-sight” measurements across a sequence of views. 



Note that in our problem an observation is a line segment (projection of 
the tangent to 7 in space), where as in [I l.'flj the observation is an image point 
corresponding to a moving point P along 7. This change in problem definition 
makes a significant difference in the kind of tools one can use for the solution. 
In the case of a point observation it was shown that when 7 is a straight line the 
solution (Plucker coordinates of the line) is linear in the image observations and 
5 views are necessary — whereas when 7 is a conic 9 views are necessary and 
the solution is non-linear. Moreover, the solution assuming 7 is a conic would 
break-down if 7 is a straight line (and will suffer from numerical instability in 
case 7 is close to a straight line). 

We will show that in the case of tangent-of-sight observations a natural tool 
for approaching the reconstruction of 7 is the degenerate quadric envelope — in 
particular the disk quadric. Quadric envelopes were introduced in the computer 
vision literature in the context of camera self-calibration (the “absolute quadric” 
1^), but here we make a different use of them. In our approach, each observation 
contributes a linear equation for a disk quadric, thus in the case each image 
contains only one observation one would need at least 9 views (or 8 views for a 
4-fold ambiguity) . Given the disk quadric it becomes a simple matter to extract 
from it the parameters of 7 (3 parameters of the plane and 5 parameters of the 



3D Reconstruction from Tangent-of-Sight Measurements 623 



conic on the plane). Moreover, the technique would work seamlessly without 
changes when 7 is a straight line as well. 

On a practical level, measurements of line segments are quite naturally ob- 
tained from images. Moreover, the occluding contour of a moving object would 
provide the tangent-of-sight measurement necessary for our computations. Thus, 
the introduction of tangent-of-sight observations in the realm “trajectory trian- 
gulation” is both practical and feasible. 

2 Background: Cameras and Quadric Loci and Envelopes 

We will be working with the projective 3D space and the projective plane. In this 
section we will describe the basic elements we will be working with: (i) camera 
projection matrices, and (ii) Quadric envelopes and the disk quadric. 

A point in the projective plane is defined by three numbers, not all zero, 
that form a coordinate vector defined up to a scale factor. The dual projective 
plane represents the space of lines which are also defined by a triplet of numbers. 
A point in the projective space is defined by four numbers, not all zero, that 
form a coordinate vector defined up to a scale factor. The dual projective space 
V* represents the space of planes which are also defined by a quadruple of 
numbers. 

The projection from 3D space to 2D space is determined by a 3 x 4 matrix. 
If P,p are corresponding 3D and 2D points, then p = MP, where = denotes 
equality up to scale. If A is a line in 2D, then A^M is the plane passing through 
the line A and the projection center of the camera. The plane is referred to as 
the “visual plane” of A. 

A quadric locus is a second order polynomial of the 3D projective coordinates 
representing points on a quadric surface. The points P € P^ satisfying the 
equation P^ QP = 0, where Q is a symmetrical 4x4 matrix define a quadric 
locus. If P is on the quadric surface, then QP is the tangent plane at that point. 

A quadric envelope is a second order polynomial of the 3D coordinates of the 
dual space - the space of planes. The planes U £ V* that satisfy the equation 
U^Q*U = 0, where Q* is a symmetric matrix, are tangents of a quadric surface. 
If C/ is a tangent to the surface (belongs to the envelope), then Q*U is the point 
on the surface defined by the intersection of the tangent plane and the surface. 
If Q is full rank, then by the principle of point-plane duality we have 

0={QP)^Q*{QP)=P^{Q^Q*Q)P =^Q* = Q-^ 

In other words, one can move from point-equation to plane-equation of a quadric 
simply by taking the cofactors matrix, i.e., Q* can be described by the cofactors 
of Q (and vice versa). 

A full rank quadric (locus and envelope) is called a proper quadric. A rank-3 
quadric is called a quadric cone. Let QX = 0, then clearly the point X belongs 
to the surface because X^ QX = 0 as well. Let P be any other point on the 
surface, i.e., P^ QP = 0. Then, the entire line aP + (3X is also on the surface: 

{aP + (3X)^Q{aP + (3X) = Q 
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Therefore, the quadric cone is generated by lines through the point (apex) X. 
The CO- factor matrix Q* is a rank-1 matrix as follows. All the planes through 
X are tangent planes, i.e., U^Q*U = 0 and X = 0, i.e., the rows of Q* are 
scaled versions of X, and U^Q*U = 0 is is the plane equation of X taken twice 
(repeated plane). 

A rank-3 quadric envelope is called a disk quadric. Let Q*y = 0, then y 
is tangent to the surface because y"^ Q*y = 0 as well. If U belongs to the 
envelope, i.e., U^Q*U = 0, then the pencil of planes all + py belong to the 
envelope as well. Therefore, the disk quadric is a “disk” of coplanar points where 
the boundary of the disk is a conic section, that is, the envelope includes all the 
planes through all the tangents to the boundary conic, i.e., oo^ pencils of oo^ 
planes (see Fig. I2D. The matrix of co-factors is a rank-1 matrix whose rows are 
scaled versions of y, the plane of the disk. 




Fig. 2. A disk quadric is a “disk” of coplanar points where the boundary of the disk 
is a conic section, that is, the envelope includes all the planes through all the tangents 
to the boundary conic, i.e., oo^ pencils of oo^ planes. 



It is important to note that the application of the principle of duality (matrix 
of cofactors) does not hold for rank-deficient quadrics. The disk quadric is the 
dual of the proper cone, yet the transformation between point-equation and 
plane-equation cannot be achieved solely through the matrix of cofactors. We 
will return to this issue later in the paper. 

Finally, a rank-2 quadric locus describes a pair of distinct planes, and a rank- 
2 quadric envelope describes a pair of distinct points. Table E summarizes the 
rank classifications of quadric loci and quadric envelopes. 



3 Reconstruction of a Conic Path from Tangents-of-Sight 

We wish to recover the path of an object moving in space from a sequence of 
images without prior knowledge of the shape of the path (line in 3D or planar 
conic in 3D). We will present a method for linearly recovering the path from 9 
views, given their 9 projection matrices and a tangent to the path of motion in 
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Table 1. Classification of quadrics according to rank 



p 


Quadric Locus 


Quadric Envelope 


4 


Proper Quadric 


Proper Quadric Envelope 


3 


Proper Cone 


Disk Quadric 


2 


Pair of planes (line) 


Pair of points(line) 


1 


Repeated Plane 


Repeated Point 



each view. The output of the algorithm in the case of a planar conic will be the 
3 parameters of the plane and the 5 parameters of the conic on that plane. In 
case of a line the output will be the Plucker coordinates of the line in 3D. Our 
method is based on the fact that both a planar conic in 3D and a line in 3D 
have a representation as a degenerate quadric envelope. 

Our problem can be stated as follows. We are given tangents of a moving 
point (or equivalently we are observing a moving tangent line tracing a conic 
envelope in space) across a number of views seen from a monocular sequence 
with known ego-motion, i.e., the camera projection matrices are assumed to be 
known. We wish to recover the conic path and to reconstruct the 3D position of 
the moving point at each time instance. See Fig. E 

Given the background on quadric envelopes, it becomes clear that the conic 
trajectory is part of a disk quadric. The observations we obtain from the image 
space is the projection of a pencil of planes (a tangent line to the boundary of 
the disk quadric) at each time instance. 

Let li be the tangent line measured in view i, and let Mi be the camera 
projection matrix at time i. Then ij Mi is the visual plane which is tangent to 
the disk quadric. We have therefore the linear set of equations: 



l^M,Q*Mj k = Q 

which provides a unique solution for Q* using 9 views, or a 4-fold ambiguity 
using 8 views (because we know that the determinant of Q* must vanish). It is 
a reasonable assumption that if the moving camera is viewing a single moving 
object we would be able to extract one tangent to the path of motion per image, 
tracking some visible feature on the object, but we are not restricted to using 
one tangent in each view. Indeed, if there appear to be multiple tangents to the 
same path of motion, for example, when tracking a train consisting of many cars 
moving along it’s track (as shown in Fig. n more than one tangent can be used 
per image. The 9 visual planes can be acquired along less than 9 images as well. 
The only limitation on the number of images is that the sequence must be at 
least 2 images long. 

Given that we have found the disk quadric Q* , the reconstructed 3D points 
are simply Pi = Q*Mjli. The plane tt on which the conic resides is the null 
space of Q* , i.e., Q*tt = 0. To recover the point-equation of the conic path we 
do the following. 
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We have seen previously that with rank-deficient quadrics the cofactors are 
not sufficient for moving between plane-equation and point-equation form. We 
describe the conic as the intersection of a proper cone and the plane tt. The apex 
of the proper cone can be chosen arbitrarily, thus let be a the chosen apex of 
the cone we wish to construct. U^Q*U = 0 describe the plane-equation of the 
disk quadric and include all tangent planes to the conic we wish to recover, thus 
by adding the constraint X = 0 we obtain a subset of the tangents planes 
that define the plane-equation of the enveloping cone whose apex is at X. For 
every such tangent plane U, P = Q*U is a point on the conic section. We wish 
to express U as a function (uniquely) of point coordinates P, as follows. We have 
two sources of equations: Q*U = P and X^U = 0, i.e.. 



Q* 

X^ 



U = 



and from the pseudo-inverse we obtain: 

U = (Q*Q* + XX^)-^Q*P = Q*P. 

Note that the matrix in parenthesis is of full rank because of the addition of 
XX^ . Substituting U in the plane-equation U^Q*U = 0 we obtain the point- 
equation: 

P^Q*^Q*Q*P = 0 

i.e., the rank-3 quadric locus (quadric cone) is Q = Q*Q"^. A numerical 

example of transformation between plane and point equations for degenerate 
quadrics would be helpful. 

Suppose we are given the plane equation of a disk quadric: 

+ XU1U2 + 6U1U3 - 6U1U4 

Ui - 2U3U4 + ui = o 

Or in matrix representation: 



ui- 



( 1 ) 



Q* = 



1 2 


3 


-3 


2 1 


0 


0 


3 0 


1 


-1 


3 0 


-1 


1 



(2) 



The plane of the conic is given by the null space of Q* and is therefore the 
plane [0, 0, 1, 1], or Z -|- 1 = 0. We now define the cone K* of enveloping planes 
tangent to Q* that coincide with the point X = [0,0,0, 1]^, by setting t/4 = 0 
in Eq. Dl We get the plane equation of K*: 



Uf + 4U1U2 + 6U1U3 ■ 

Or in matrix representation: 



C/2 + [7| = 0 



( 3 ) 






12 3 0 
2 10 0 
3 0 10 
0 0 0 0 



( 4 ) 
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Next, we translate the plane equation of the cone into its point equation, by 
expressing it in the coordinates of the tangency point to a general plane U G K*: 





'U1 + 2U2 + 3U3 






K*U = 


2Ui + U2 




Y 




3 Ui + U3 




Z 




0 




[wj 



We solve Eq. 0 for C/i , C/ 2 , C /3 thus, expressing them as linear combinations of 

X,y,Z: 

Ui = ^X+lY+\Z 

C /2 = fF+|X-iZ ( 6 ) 

c /3 = |z+|x-|r 

Substituting Eq. Elfor C/i, C/ 2 , C /3 in the plane equation, gives us the point equa- 
tion of the cone K, in X,Y,Z: 

X 2 - 4XY - 6XZ - 8Y‘^+ 

12YZ - 3^2 = 0 

We now have the point equations of two surfaces, defining the conic in terms of 
points in space. 

The intersection of these two surfaces is the conic on the plane, the collection 
of points which belong both to the cone and to the plane of the disk quadric. 
The equation of the plane gives us Z = —1, which we substitute in Eq. Qand 
we get the equation of the conic on the plane: 

- 4XY + 6X - 8Y^ - 12Y -3 = 0 (8) 



3.1 Recovering Plucker Coordinates of a Line 

Given a quadric envelope Q* representing a line in 3D, we would like to recover 
the Plucker coordinates of the line. As we learn from Table in Q* is of rank 2 
and represents a line by encoding the information of a pair of 3D points that 
coincide with the line. The null space of Q* is of dimension 2 and consists of two 
planes: 

null{Q*) = [tti, 7T2] 

These planes satisfy TTiQ*Trf = 0, 7 T 2 ( 5 * 7 r^ = 0, thus they are tangent to Q*. 
Since Q* is a line, it must lie on both planes tti and 7 T 2 , it is the line of intersection 
of 7 Ti and 7 T 2 . We can find two points P, Q on the line of intersection and perform 
the join operation on them to get the Plucker coordinates of the line. 

[P,Q] = null(j^]^'^ (9) 



L = PAQ 

= [Xp-X^,Yp-Y^,Zp- 



^91 



XpYq — YpXq, XpZq — ZpXq, 
YpZq - ZpYq] 



( 10 ) 
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4 Experiments 

We have conducted a number of experiments on both synthetic and real image 
sequences. We report here a number of examples of real image sequences, varying 
in the amount and type of measurements used for recovering the motion path 
7 of a moving object. For each experiment we will present the input sequence 
(in whole or partially) showing the tangent-of-sight lines used as input and the 
resulting images, which show the recovered path 7 . 

In all the examples, we used sequences taken by a hand-held moving camera. 
The tangents in the images were marked manually and the projection matri- 
ces were recovered from matching points in the static background — both the 
tangents and the projection matrices were passed to the algorithm as input. 

The first example demonstrates the unifying quality of the algorithm, hand- 
ling a scene containing both a conic trajectory and a line trajectory. We took a 
sequence of 9 images viewing a toy train moving along its circular track and a 
toy jeep moving along a straight line. Both for the trajectory of the train and 
for the trajectory of the car, one tangent was used in each of the 9 images. The 
tangents to the trajectory of the train were taken from the bottom of the engine 
car. The tangents to the trajectory of the jeep were taken where the wheels touch 
the chess-board. Fig. 0 shows the 9 input images, in each image the tangent to 
the train is drawn in black and the tangent to the jeep is drawn in white 

Fig. El shows the result of this experiment. The images shown are a subset of 
the image sequence. In each image the recovered conic trajectory of the train, 
projected to the image, is drawn in black and the recovered line trajectory of 
the jeep is drawn in white. 

The second example was designed for quantitative estimation of accuracy. 
The example was created using 9 images viewing a spinning turntable, with one 
tangent to the turntable taken in each image. In the background we placed a 
static object, used for Euclidean calibration, consisting of three planes orthogonal 
to each other featuring a regular chess-board pattern. The plane on the floor was 
taken to be T = 100 and the two other planes were taken to be X = 0 and Z = 0. 
The projection matrices were created from points on the calibration object. The 
size of each square on the grid is 2.5 x 2.5 cm, and was taken to be 10 x 10 in out 
coordinate system. The height of the turntable is 3cm, the plane of the turntable 
is thus the plane Y = 120. The plane of the conic recovered by the algorithm 
was: 

0.0026X -k r -k 0.0292Z = 112.1446 

which is very close to the known plane. Fig. 0shows 4 of the input images used 
in this example with the tangent taken in each image, the projection of the conic 
onto each image and a regular grid on the plane of the conic. 

5 Summary 

We have introduced the problem of recovering the path of a moving object seen 
from a moving camera using tangent-of-sight measurements and have shown that 
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Fig. 3. Input sequence for 1st example, 9 images containing one tangent to each tra- 
jectory per image. Tangents to conic trajectory are drawn in black, tangents to line 
trajectory are drawn in white. 



the degenerate quadric envelope is a natural tool for the problem of reconstruc- 
ting second-order paths in space. 

Our technique is very simple as each observation provides a linear estimate 
to a disk quadric that contains all the necessary information for recovering the 
8 parameters of the conic shaped path. A byproduct of our approach is that the 
system does not breakdown when the the path is a straight line as it is a special 
case of a degenerate quadric envelope. 

References 

1. S. Avidan and A. Shashua. Trajectory triangulation of lines: Reconstruction of a 
3d point moving along a line from a monocular image sequence. In Proceedings of 
the IEEE Conferenee on Computer Vision and Pattern Recognition, June 1999. 

2. R.A. Manning and C.R. Dyer. Interpolating view and scene motion by dynamic 
view morphing. In Proceedings of the IEEE Conference on Computer Vision and 
Pattern Recognition, pages 388-394, Fort Collins, Co., June 1999. 

3. A. Shashua, S. Avidan, and M. Werman. Trajectory triangulation over conic sec- 
tions. In Proceedings of the International Conference on Computer Vision, pages 
330-336, Corfu, Greece, September 1999. 




630 D. Segal and A. Shashua 




Fig. 4. Recovered planar conic (in black) and line (in white) are projected to several 
reference images from the input sequence 
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Fig. 5. 4 out of 9 images from input sequence used for numerical example. Images show 
projection of the conic to each image drawn in black, tangent in each image drawn in 
white and a regular grid of bright dots on the plane of the conic. 
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Abstract. The paper has two main contributions: The first is a set 
of methods for computing structure and motion for m > 3 views of 6 
points. It is shown that a geometric image error can be minimized over 
all views by a simple three parameter numerical optimization. Then, that 
an algebraic image error can be minimized over all views by computing 
the solution to a cubic in one variable. Finally, a minor point, is that this 
“quasi-linear” linear solution enables a more concise algorithm, than any 
given previously, for the reconstruction of 6 points in 3 views. 

The second contribution is an m view n > 6 point robust reconstruction 
algorithm which uses the 6 point method as a search engine. This extends 
the successful RANSAC based algorithms for 2-views and 3- views to m 
views. The algorithm can cope with missing data and mismatched data 
and may be used as an efficient initializer for bundle adjustment. 

The new algorithms are evaluated on synthetic and real image sequences, 
and compared to optimal estimation results (bundle adjustment). 



1 Introduction 

A large number of methods exist for obtaining 3D structure and motion from 
features tracked through image sequences. Their characteristics vary from the 
so-called minimal methods which work with the least data necessary to 

compute structure and motion, through intermediate methods FTm which may 
perform mismatch (outlier) rejection as well, to the full-bore bundle adjustment. 

The minimal solutions are used as search engines in robust estimation algo- 
rithms which automatically compute correspondences and tensors over multiple 
views. For example, the two-view seven-point solution is used in the RANSAC 
estimation of the fundamental matrix in 1221 , and the three-view six-point so- 
lution in the RANSAC estimation of the trifocal tensor in |2U- It would seem 
natural then to use a minimal solution as a search engine in four or more views. 
The problem is that in four or more views a solution is forced to include a 
minimization to account for measurement error (noise). This is because in the 
two-view seven-point and three-view six-point cases there are the same num- 
ber of measurement constraints as degrees of freedom in the tensor; and in both 
cases one or three real solutions result (and the duality explanation for this equi- 
valence was given by P|). However, the four- views six-points case provides one 
more constraint than the number of degrees of freedom of the four-view geome- 
try (the quadrifocal tensor). This means that unlike in the two- and three- view 
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cases where a tensor can be computed which exactly relates the measured points 
(and also satisfies its internal constraints), this is not possible in the four (or 
more) view case. Instead it is necessary to minimize an image measurement error 
whether algebraic or geometric. 

In this paper we develop a novel quasi-linear solution for the 6 point case in 
three or more views. The solution minimizes an algebraic image error, and its 
computation involves only a SVD and the solution of a cubic equation in a single 
variable. This is described in section 0 We also describe a sub-optimal method 
(compared to bundle adjustment) which minimizes geometric image error at the 
cost of only a three parameter optimization. Before describing the new solutions, 
we first demonstrate the poor estimate which results if the error that is mini- 
mized is not in the measured image coordinates, but instead in a projectively 
transformed image coordinate frame. This is described in section El 

A second part of the paper describes an algorithm for computing a reconstruc- 
tion of cameras and 3D scene points from a sequence of images. The objectives 
of such algorithms are now well established: 



1. Minimize reprojection error. A common statistical noise model assu- 
mes that measurement error is isotropic and Gaussian in the image. The 
Maximum Likelihood Estimate in this case involves minimizing the total 
squared reprojection error over the cameras and 3D points. This is bundle 
adjustment. 

2. Cope with missing data. Structure-from-motion data often arises from 
tracking features through image sequences and any one track may persist 
only in few of the total frames. 

3. Cope with mismatches. Appearance-based tracking can produce tracks of 
non-features. A common example is a T-junction which generates a strong 
corner, moving slowly between frames, but which is not the image of any 
one point in the world. 



Bundle adjustment |S| is the most accurate and theoretically best justified 
technique. It can cope with missing data and, with a suitable robust statisti- 
cal cost function, can cope with mismatches. It will almost always be the final 
step of a reconstruction algorithm. However, it is expensive to carry out and, 
more significantly, requires a good initial estimate in order to be effective (fewer 
iterations, and less likely to converge to local minimum). Current methods of 
initializating a bundle adjustment include factorization j 1 1 )ll 21 1 iSI23j . hierarchi- 
cal combination of sub-sequences jS|, and the Variable State Dimension Filter 

(vsDF) in). 

In the special case of affine cameras, factorization methods m minimize 
reprojection error ED and so give the optimal solution found by bundle ad- 
justment. However, factorization cannot cope with mismatches, and methods to 
overcome missing data HH lose the optimality of the solution. In the general case 
of perspective projection iterative factorization methods have been successfully 
developed and have recently proved to produce excellent results. The problems 
of missing data and mismatches remain though. 
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In this paper we describe a novel algorithm for computing a reconstruction 
satisfying the three basic objectives above (optimal, missing data, mismatches). 
It is based on using the six-point algorithm as a robust search engine, and is 
described in section 0] 



Notation. The standard basis will refer to the five points in whose homo- 
geneous coordinates are : 



E 



1 — 







Es = 



/I 

1 

1 

Vi 



For a 3- vector v = {x,y, z)^ , we use [v]^ to denote the 3x3 skew matrix such 
that [v]^ u = V X u , where x denotes the vector cross product. For three points 
in the plane, represented in homogeneous coordinates by x,y,z, the incidence 
relation of collinearity is the vanishing of the bracket [x,y, z] which denotes the 
determinant of the 3x3 matrix whose columns are x,y,z. It equals x • (y x z) 
where • is the vector dot product. 



2 Linear Estimation Using a Duality Solution 

This section briefly outlines a method proposed by Hartley 0 for computing a 
reconstruction for six points in three or more views. The method is based on 
the Carlsson and Weinshall m duality between points and cameras. From this 
duality it follows that an algorithm to compute the fundamental matrix (seven 
or more points in two views) may be applied to six points in three or more 
views. This has the advantage that it is a linear method however, as we shall 
demonstrate, the error distribution that is minimized is transformed in a highly 
non linear way, leading to a biased estimate. Thus this algorithm is only included 
here as a warning against minimizing errors in a projectively transformed image 
frame - we are not recommending it. 

The duality proceeds as follows. A projective basis is chosen in each image 
such that the first four points are 




Assuming in addition that the corresponding 3D points are Ei,...,E 4 , the 
camera matrix may be seen to be of the form 



P = 



ai dii 

bi di 
Oi 



( 1 ) 



Such a camera matrix is called a reduced camera matrix. Now, if X = (x, Y, Z, t)^ 
is a 3D point, then it can be verified that 
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a —d 
b -d 
c —d 




X — T 
Y — T 
Z — T 




(2) 



Note that the roles of point and camera are swapped in this last equation. 
This observation allows us to apply the algorithm for projective reconstruction 
from two views of many points to solve for six point in many views. The general 
idea is as follows. 



1. Apply a transformation to each image so that the first four points are mapped 
to the points of a canonical image basis. 

2. The two other points in each view are also transformed by these mappings 
- a total of two points in each image. Swap the roles of points and views to 
consider this as a set of two views of several points. 

3. Use a projective reconstruction algorithm (based on the fundamental matrix) 
to solve the two-view reconstruction problem. 

4. Swap back the points and camera coordinates as in ©. 

5. Transform back to the original image coordinate frame. 

The main difficulty with this algorithm is the distortion of the image measu- 
rement error distributions by the projective image mapping. One may work very 
hard to find a solution with minimal residual error with respect to the transfor- 
med image coordinates only to find that these errors become very large when 
the image points are transformed back to the original coordinate system. A cir- 
cular Gaussian distribution is transformed by a projective transformation to a 
distribution that is no longer circular, and not even Gaussian. This is illustrated 
in figure 0 Gommon methods of two- view reconstruction are not able to handle 
such error distributions effectively. The method used for reconstruction from 
the transformed data was a dualization of one of the best methods available for 
two- view reconstruction - an iterative method that minimizes algebraic error |0| . 

3 Reconstruction from Six Points over m Views 

This section describes the main algebraic development of the six point method. In 
essence it is quite similar to the development given by Hartley |Z] and Quan uni 
for a reconstruction of 6 points from 3 views. The difference is that Quan used 
a standard projective basis for both the image and world points, whereas here 
the image coordinates are not transformed. As described in section |5| the use 
of a standard basis in the image severely distorts the error that is minimized. 
The numerical results that follow demonstrate that the method described here 
produces a near optimal solution. 

In the following it will be assumed that we have six image points in 
correspondence over m views. The idea then is to compute cameras for each 
view such that the scene points project exactly to their image for the first 
five points. Any error minimization required is then restricted to the sixth point 
Xg, in the first instance, leading to a three parameter optimization problem. 
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projective 
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Fig. 1. Left: Residual error as a function of image noise for six points over 20 views. 
The upper curve is the result of a duality-based reconstruction algorithm, the lower 
is the result of bundle adjustment. The method for generating this synthetic data 
is described in section As may be seen the residual error of the duality-based 
algorithm is extremely high, even for quite low noise levels. It is evident that this 
method is unusable. In fact the results prove to be unsatisfactory for initializing a 
bundle adjustment in the original coordinate system. Right: Minimizing geometric 
error (as algebraic error minimization tries to approximate this) in a very projectively 
transformed space pulls back to a point away from the ellipse centre in the original 
image. 



3.1 A Pencil of Cameras 

Each correspondence between a scene point X and its image x under a perspec- 
tive camera P gives three linear equations for P whose combined rank is 2. These 
linear equations are obtained from 

X X PX = 0 (3) 

Given only five scene points, assumed to be in general position, it is possible 
to recover the camera up to a 1-parameter ambiguity. More precisely, the five 
points generate a linear system of equations for P which may be written Mp = 0, 
where M is a 10 x 12 matrix formed from two of the linear equations o of each 
point correspondence, and p is P written as a 12- vector. This system of equations 
has a 2-dimensional null-space and thus results in a pencil of cameras. 

We are free to choose the position of the five world points (e.g. they could be 
chosen to be the points of the standard projective frame Ei, . . . ,Es) thus both 
Xi and Xi (* = 1, . . . , 5) are known and the null-space of M can immediately be 
computed. The null-space will be denoted from here on by the basis of 3 x 4 
matrices [A, B]. Then for any choice of the scalars {fi : v) G the camera in the 
pencil P = fj,A + vB exactly projects the first five world points to to the first five 
image points. 

Each camera P in the pencil has its optical centre located as the null- vector of 
P and thus a given pencil of cameras gives rise to a 3D curve of possible camera 
centres. In general (there are degenerate cases) the locus of possible camera 
centres will be a twisted cubic passing through the five world points. The five 
points specify 10 of the 12 degrees of freedom of the twisted cubic, the remaining 
2 degrees of freedom are specified by the 2 plane projective invariants of the five 
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image points. If a sixth point in 3-space lies on the twisted cubic then there is a 
one parameter family of cameras which will exactly project all six space points 
to their images. This situation can be detected (in principle) because if the space 
point lies on the twisted cubic then all 6 image points lie on a conic. 



3.2 The Quadric Constraints 

We continue to consider a single camera P mapping a set of point Xi , . . . , Xg to 
image points xi, . . . ,Kq. Let [A,B] be the pencil of cameras consistent with the 
projections of the first five points. Since P lies in the pencil, there are scalars 
: ly) G such that P = -P and so the projection of the sixth world 
point Xg is xg = /iAXg + j/BXg. This means that the three points Xg,AXg,BXg 
are collinear in the image, so 

[xg, AXg, BXg] = 0, (4) 

which is a quadratic constraint on Xg. The 3x3 determinant of can be 
expressed as Xg^(A^[xg]xB)Xg = 0. As the skew-symmetric part of the matrix 
(A^[xg]xB) does not contribute to this equation, 0) is equivalent to the con- 
straint that Xg lies on a quadric Q specified by the symmetric part of (A^ [xg] x B) . 
Also, by construction, each of the first five points X^ (i = 1, . . . , 5) lies on Q since 
Xi^QXi = Xi^A^[xg]xBXi = Xi^[xg]xXi = 0. To summarize so far 
Let [A, B] be the pencil of cameras consistent with the projections of five known 
points Xi to image points x^. Let Xg be a sixth image point. Then the 3D point 
Xg mapping to Xg must lie on a quadric Q given by 

Q = ( A^ [xg] X B) sym = A^ [xg] x B — [xg] x A . (5) 

In addition, the known points Xi, . . . , Xg also lie on Q. 

In the particular case case where the five points X^ are the points of a 
projective basis the conditions X^^QX^ = 0 allow the form of Q (or indeed of 
any quadric Q which passes through each E^) to be specified in more detail: from 
Ei^QEi = 0 for i = 1, ... ,4, we deduce that the four diagonal elements of Q 
vanish. From Eg^QEg it follows that the sum of elements of Q is zero. Thus, we 
may write Q in the following form 



Q = 



0 Wi W2 —S 
Wl 0 W3 W4 
W2 W3 0 Ws 
— U W4 Ws 0 



(6) 



where D = wi + W 2 + W 3 + W 4 + W 5 . The conclusion we draw from this is that 
if Xg = {p,q,r,s)^ is a point lying on Q, the equation Xg^QXg = 0 may be 
written in vector form as 



{wi,W2,W3,W4,W3)^ 



[pq-ps\ 
pr — ps 
qr — qs 
qs — ps 
\rs — ps / 



= 0 



(7) 



or more briefly, = 0, where X is the column vector in O- 
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Note, equations are algebraically equivalent to the equations obtained 

by Quan m and Carlsson-Weinshall However, they differ in that here the 
original image coordinate system is used, with the consequence that a different 
numerical solution is obtained in the over constrained case. It will be seen in 
section roi that this algebraic solution may be a close approximation of the 
solution which minimizes geometric error. 

Solving for the Point X. Now consider m views of 6 points and suppose 
again that the first five world points are in the known positions Ei, . . . , E 5 . To 
compute projective structure it suffices to find the sixth world point Xg. In the 
manner described above, each view provides a quadric on which Xg must lie. 
For two views the two associated quadrics intersect in a curve, and consequently 
there is a one parameter family of solutions for Xg in that case. The curve will 
meet a third quadric in a finite number of points, so 3 views will determine a 
finite number (namely 2x2x2 = 8 by Bezout’s theorem) of solutions for Xg. 
However, five of these points are the points Ei , . . . , Eg which must lie on all three 
quadrics. Thus there are up to three possible solutions for Xg. With more than 
three views, a single solution will exist, except for critical configurations m- 
The general strategy for finding Xg is as follows: For each view j, the qua- 
dratic constraint Xg^Q^Xg = 0 on Xg can be written as the linear constraint 
= 0 on the 5- vector X defined in terms of Xg by equation Q). The vector 
w-1 is obtained from the coefficients of the quadric (see below). The basic 
method is to solve for X G by intersecting hyperplanes in P^, rather than 
to solve directly for X S P^ by intersecting quadrics in P^. 

In more abstract terms there is a map ip : P^ S>P'^, given by '0 : X i— > 

which is a (rational) transformation from P^ to P^, and maps any quadric 
Q C P^ through the five basepoints E^ into the hyperplane defined in P'^ by 

wiXi 1V2X2 + W3X3 UI4A4 -|- W3X3 = 0 (8) 



where the (known) coefficients Wi of w are Qi2, Q13, Q23, Q24, Qs4- 



Computing X from X. Having solved for X = (a, 6 , c, d, e)^ we wish to 
recover X = (p, q, r, s)^. By considering ratios of a, b, c, d, e and their differences, 
various forms of solution can be obtained. In particular it can be shown that X 
is a right nullvector of the following 6x4 design matrix: 



/ e — d 0 0 a — b\ 

e — c 0 a 0 

d — c b 0 0 

0 e — b a — d 0 
0 e 0 a — c 

yO 0 d b — c J 



(9) 



This will have nullity > 1 in the ideal noise-free case where the point X = 
(a,b,c,d,e)^ really does lie in the range of When the point X does not lie 
exactly in the image of tjj, the matrix may have full rank, i.e. no nullvector. In the 
following we determine a solution such that the matrix always has a nullvector. 
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A Cubic Constraint. The fact that dimP^ = 3 < 4 = dimP^ implies that 
the image of ^ is not all of P"^. In fact the image is the hypersurface S cut out 
by the cubic equation 



S{a, b, c, d, e) = abd — abe + ace — ade — bed + bde = 



e e b 
deb 
d a a 



= 0 



(10) 



This can be verified by direct substitution. Alternatively it can be derived by 
observing that all 4 x 4 subdeterminants of 0 must vanish, since it is rank defi- 
cient. These subdeterminants will be quartic algebraic expressions in a, b, c, d, e, 
but are in fact all multiples of the cubic expression S. 

The fact that the image 'i/i’(X) of X must lie on S introduces the problem of 
enforcing this constraint {S = 0) numerically. This will be dealt with below. 



Solving for 3 Views of Six Points. The linear constraints defined by the 
three hyperplanes (|EI) cut out a line in P'^. The line intersects S in three points 
(generically) (see figure EJ. Thus there are three solutions for X. This is a well- 
known j 1 5) minimal solution. Our treatment gives a simpler (than the Quan [1 6) 
or Carlsson and Weinshall 0) algorithm for computing a reconstruction from 
six points (and thereby computing a trifocal tensor for the minimum number of 
point correspondences as in |2H) because it does not require changing basis in 
the images. To be specific, the algorithm for three views proceeds as follows: 

1. From three views, obtain three equations of the form Q = 0 in the 

five entries of X. Collecting together the as the rows of a 3 x 5 matrix 
W, this may be written Wff = 0 , which is a homogeneous linear system. 

2. Obtain a set of solutions of the form X = aX^ X PX 2 where Xi and X 2 are 
generators of the null space of the 3x5 linear system. 

3. By expanding out the constraint dmj, form a homogeneous cubic equation 
in a and (3. There will be either one or three real solutions. 

4. Once X is computed (satisfying the cubic constraint (1 1 1 )ll 1 . solve for 

Xe = (Pj s)^- This could be computed as the null-space of the matrix 
0 , or more directly, as a vector of suitably chosen 4x4 minors of that 
matrix. 



3.3 Four or More Views 

We extend the solution above from three to m views as follows: an equation of 
the form 0 w^^ff = 0 is obtained for each view, and these may be combined 
into a single equation of the form MX — 0, where W is a m x 5 matrix for m 
views. 

Now, in the case of perfect data (no measurement error) W has rank 4 for 
m > 4 views, and the nullvector is the unique (linear) solution for X . The point 
Xg is then obtained from X, e.g. as the null-vector of (0 {X satisfies the cubic 
constraint CU). 

However, if there is measurement error (noise) then there are two problems. 
First, for m = 4 views, although W has rank 4 the (linear) solution X to MX — 0 
may not satisfy the cubic constraint li 1 1 )il . i.e. the linear solution may not lie on 
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S (and so a unique value of X.q cannot be obtained as a null- vector from (0 
because that matrix will have full rank). Second, and worse still, in the case of 
m > 4 views the matrix W will generally have full rank, and there is not even an 
exact linear solution for X. 

Thus for TO > 4, we require another method to produce a solution which 
satisfies the cubic constraint 5 = 0. The problem is to perform a “manifold 
projection” of the least-squares solution to Wff = 0 onto the constraint manifold, 
but in a non-Euclidean space with the usual associated problem that we don’t 
know in which direction to project. We will now give a novel solution to this 
algebraic problem. 

Algebraic Error. An (over)determined linear system of equations is often 
solved using Singular Value Decomposition, by taking as null- vector the singular 
vector with the smallest singular value. The justification for this is that the 
SVD elicits the “directions” of space in which the solution is well determined 
(large singular values) and those in which it is poorly determined (small singular 
values). Taking the singular vector with smallest singular value is the usual 
“linear” solution, but as pointed out, it does not in general lie on S. However, 
there may still be some information left in the second-smallest singular vector, 
and taking the space spanned by the two smallest singular vectors gives a line 
in P^, which passes through the “linear” solution and must also intersect S in 
three points {S is cubic). We use these three intersections as our candidates for 
X. Since they lie exactly on S, recovering their preimages X under '0 is not a 
problem. 

Geometric Error. In each image, fitting error is the distance from the repro- 
jected point y = PX to the measured image point x = (u, v, 1)^. The reprojected 
point will depend both on the position of the sixth world point and on the choice 
of camera in the pencil for that image. But for a given world point X, and choice 
of camera P = /iA -|- izB in the pencil, the residual is the 2D image vector from x 
to the point y = PX = /rAX-f ^BX on the line 1 joining AX and BX. The optimal 
choice of /r, v for given X is thus easy to deduce; it must be such as to make y 
the perpendicular projection of x onto this line (figure El). What this means is 
that explicit minimization over camera parameters is unnecessary and so only 
the 3 degrees of freedom for X remain. 

Due to the cross-product, the components li{X) of the line 1(X) = AX x BX 
are expressible as homogeneous quadratic functions of X, and we note that these 
are expressible as linear functions of X = ■0(X). This is because the quadratic 
function AX x BX vanishes at each and so, as was noted in section [1.21 (in 
particular, equation O), has the form derived earlier for such quadrics. Thus: 

//l(X)\ 

1(X) = AX X BX = Za(X) = • ■ • qa • ■ • \x 

Vz3(x)/ V-qs--'/ 

for some 3x5 matrix with rows whose coefficients can be determined from 
those of A and B. If the sixth image point is x = {u,v, 1)^ as before, then the 
squared geometric image residual becomes: 
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Fig. 2. Left: The diagram shows a line in 3-space intersecting a surface of degree 3. In 
the case of a line in 4-space and a hyper-surface of degree 3, the number of intersections 
is also 3. Right: Minimizing reprojection in the reduced model. For a given X, the best 
choice P = + i/B of camera in the pencil corresponds to the point y = ^AX -|- zzBX on 

the line closest to the measured image point x. Hence the image residual is the vector 
joining x and y. 



. ^ |uH(X)+W.(X) + Z3(X)|^ ^ + + 

and this is the geometric error (summed over each image) which must be mi- 
nimized over X. We can now compare the algebraic cost to the geometric cost. 
The algebraic error minimized is Wdf, which corresponds to summing an alge- 
braic residual | (u qi -h u q 2 -h over each image. Thus, the algebraic cost 

neglects the denominator of the geometric cost (II III . 



Invariance of Algebraic Error. As we have presented the algorithm so far, 
there is an arbitrary choice of scale for each quadric Qa.Bj corresponding to the 
arbitrariness in the choice of representation [A, B] of the pencil of cameras, the 
scale of which depends on the scale of A, B. Which normalization is used matters, 
and we address that issue now. 

Firstly, by translating coordinates, we may assume that the sixth point is 
at the origin. The assumption u, n = 0 on the position of the sixth image point 
makes our method invariant to translations of image coordinates. It is desira- 
ble that the normalization should be invariant to scaling and rotation as well 
since these are the transformations which preserve our error model (isotropic 
Gaussian noise, see below). Our choice of normalization is most simply de- 
scribed by introducing a dot product similar to the Frobenius inner product 
(A,B)pj.^j^ = trace(A^B) = AijBij. Our inner product simply leaves out the 
last row: 



(A,B)j 



— X) i = 1.2.3 AijBij 

3 = 1 , 2 , 3, 4 



(A,B), 



— X «=i,2 AijBij 

3 = 1 , 2 , 3, 4 



The normalization we use can now be described by saying that the choice 
of basis of the pencil [A, B] must be an orthonormal basis with respect to (•, X- 
To achieve this, one could start with any basis of the pencil and use the Gram- 
Schmidt algorithm to orthonormalize them. It can be shown that with this nor- 
malization the algebraic error is invariant to scaling and rotation of the image 
coordinate system. 
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3.4 Algorithm Summary 

It has been demonstrated how to pass from m > 3 views of six points in the 
world to a projective reconstruction in a few steps. These are: 

1. Compute, for each of m views, the pencil of cameras which map the five 
standard basis points in the world to the first five image points, using the 
recommended normalization to achieve invariance to image coordinate chan- 
ges. 

2. Form from each pencil [A,B] the quadric constraint on the sixth world point 
X as described in section IrT^ i.e. form ([^ X = 0 in the five entries of X. 

3. Collect together the as the rows of a m x 5 matrix W. 

4. Obtain the singular vectors corresponding to the two smallest singular values 
of W via the SVD. Let these be Xi and X 2 - 

5. The solution X lies in the one-parameter family X = aX\ X (3X2- 

6. By expanding out the constraint (unj, form a homogeneous cubic equation 
in a and (3. There will be either one or three real solutions. 

7. Once X is computed (satisfying the cubic constraint II 1 1 )ll V solve for Xe from 
the null-space of O- 

8. (optional) Minimize reprojection error (1111) over the 3 degrees of freedom in 
the position of Xe. 

In practice, for a given set of six points, the quality of reconstruction can vary 
depending on which point is last in the basis. We try all six in turn and choose 
the best one. 

Related Work. Yan et al ^3] describe a linear method for reconstruction 
from m > 4 views of six points. Both our method and theirs turn the set of 
m quadratic equations in X into a set of m linear equations in some auxiliary 
variables {X here), and then impose constraints on a resulting null-space. There 
are two problems with their method when measurement error is present: first, 
their solution may not satisfy (both) the constraints on the auxiliary variable 
and second, their method uses projectively transformed image coordinates, and 
so potentially suffers from the bias described in section El 

3.5 Results I 

We have computed cameras which map the first five points exactly to their 
measured image points, and then minimize either an algebraic or geometric error 
on the sixth point. As discussed in the introduction, the Maximum Likelihood 
Estimate of the reconstruction, assuming isotropic Gaussian measurement noise, 
is obtained by bundle adjustment in which reprojection error (squared geometric 
image residual) is minimized over all points X^ and all cameras. This will is the 
optimal reconstruction. Minimizing error on only the sixth image point is thus 
a sub-optimal method. 

We now give results on synthetic and real image sequences of 6 points in m 
views. The objective is to compare the performance of four algorithms: 

Quasi-linear: minimizes algebraic error on sixth point only (as in the algo- 
rithm above). 
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Sub-optimal: minimizes reprojection error on the sixth point only (as in (HU) 
by optimizing over Xg. This is a three parameter optimization problem. It 
is initialized by the quasi-linear algorithm. 

Factorization: a simple implementation of projective factorization (the pro- 
jective depths are initialized as all Is and ten iterations performed). 

Bundle Adjustment: minimizes reprojection error for all six points (varying 
both cameras and points X.^). This is a 11m -|- 18 parameter optimization 
problem for m views and six points. For synthetic data, it is initialized by 
whichever of the above three gives the smallest reconstruction error. For real 
data, it is initialized with the sub-optimal algorithm. 

The three performance measures used are reprojection error, reconstruction er- 
ror (the registration error between the reconstruction and ground truth), and 
stability (the algorithm converges). The claim is that the quasi-linear algorithm 
performs as well as the more expensive variants and can safely be used in prac- 
tice. 




Fig. 3. Summary of experiments on synthetic data. 1000 data sets were generated 
randomly (7 views of 6 points) and each algorithm tried on each data set. Left: For each 
of the four estimators (quasi-linear, sub-optimal, factorization and bundle adjustment), 
the graph shows the average rms reprojection error over all 1000 data sets. Middle: the 
average reconstruction error, for each estimator, into the ground truth frame. Right: 
the average number of times each estimator failed (i.e. gave a reprojection error greater 
than 10 pixels). 



Synthetic data. We first show results of testing the algorithm on synthetic 
data with varying amounts of pixel localisation noise added; our noise model is 
isotropic Gaussian with standard deviation a. For each value of cr, the algorithm 
is run on 1000 randomly generated data sets. Each data set is produced by 
choosing six world points at random uniformly in the cube [— 1,4-1]^ and six 
cameras with centres between 4 and 5 units from the origin and principal rays 
passing through the cube. After projecting each point under each chosen camera, 
artificial noise is added. The images are 512 x 512, with square pixels, and the 
principal point is at the centre of the image. Figure 0 summarizes the results. 

The “failures” refer to reconstructions for which some reprojection error ex- 
ceeded 10 pixels. The quality of reconstruction degrades gracefully as the noise 
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is turned up from the slightly optimistic 0.5 to the somewhat pessimistic 2.5; 
the rms and maximum reprojection error are highly correlated, with correla- 
tion coefficient 0.999 in each case (which may also be an indicator of graceful 
degradation). 



Real Data. The image sequence consists of 10 colour images (JPEG, 768 x 
1024) of a turntable, see figure 2] The algorithms from before, except factoriza- 
tion, are compared on this sequence and the results tabulated also in figure 0 
Points were entered and matched by hand using a mouse (estimated accuracy is 
2 pixels standard deviation). Ground truth is obtained by measuring the turn- 
table with vernier calipers, and is estimated to be accurate to 0.25mm. There 
were 9 tracks, all seen in all views. Of course, in principle any six tracks could 
be used to compute a projective reconstruction, but in practice some bases are 
much better than others. Examples of poor bases include ones which are almost 
coplanar in the world or which have points very close together. 
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basis residuals 
(pixels) 


all residuals 
(pixels) 


reconstruction 
error (mm) 


6 points quasi-linear 
6 points sub-optimal 
6 points bundle adjustment 
All points (and cameras) bundled 


0.363 /2.32 
0.358 /2.33 
0.115 /0.476 
0.334 /0.822 


0.750/2.32 

0.744/2.33 

0.693/2.68 

0.409/1.08 


0.467/0.676 

0.424/0.596 

0.405/0.558 

0.355/0.521 



Fig. 4. Results for the 9 tracks over the 10 turntable images. The reconstruction is 
compared for the three different algorithms, residuals (reported as rms/max) are shown 
for the 6 points which formed the basis (first column) and for all reconstructed points 
taken as a whole (second column). The last row shows the corresponding residuals after 
performing a full bundle adjustment. 



Bundle adjustment achieves the smallest reprojection error over all residuals, 
because it has greater freedom in distributing the error. Our method minimizes 
error on the sbcth point of a six point basis. Thus it is no surprise that the 
effect of applying bundle adjustment to all points is to increase the error on 
the basis point (column 1) but to decrease the error over all points (column 
2). These figures support our claim that the quasi-linear method gives a very 
good approximation to the optimized methods. Figure 0 shows the reprojected 
reconstruction in a representative view of the sequence. 
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Fig. 5. Left: Reprojected reconstruction in view 3. The large white dots are the input 
points, measured from the images alone. The smaller, dark points are the reprojected 
points. Note that the reprojected points lie very close to the centre of each white dot. 
The reconstruction is computed with the 6-point sub-optimal algorithm. Right: The 
graph shows for each algorithm, the rms reprojection error for all 9 tracks as a function 
of the number of views used. For comparison the corresponding error after full-bore 
bundle adjustment is included. 



4 Robust Reconstruction Algorithm 

In this section we describe a robust algorithm for reconstruction built on the 
6-point engine of section 0 The input to the algorithm is a set of point tracks, 
some of which will contain mismatches. Robustness means that the algorithm 
is capable of rejecting mismatches, using the RANSAC 0 paradigm. It is a 
straightforward generalization of the corresponding algorithm for 7 points in 2 
views and 6 points in 3 views CEU. 

Algorithm Summary. The input is a set of measured image projections. A 
number of world points have been tracked through a number of images. Some 
tracks may last for many images, some for only a few (i.e. there may be missing 
data). There may be mismatches. Repeat the following steps as required: 

1. From the set of tracks which appear in all images, select six at random. This 
set of tracks will be called a basis. 

2. Initialize a projective reconstruction using those six tracks. This will provide 
the world coordinates (of the six points whose tracks we chose) and cameras 
for all the views (either quasi-linear or with 3 degrees of freedom optimization 
on the sixth point - see below). 

3. For all remaining tracks, compute optimal world point positions using the 
computed cameras by minimizing the reprojection error over all views in 
which the point appears. This involves a numerical minimization. 

4. Reject tracks whose image reprojection errors exceed a threshold. The num- 
ber of tracks which pass this criterion is used to score the reconstruction. 

The justification for this algorithm is, as always with RANSAC, that once a 
“good” basis is found it will (a) score highly and (b) provide a reconstruction 
against which other points can be tested (to reject mismatches). 
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4.1 Results II 

The second sequence is of a dinosaur model rotating on a turntable (figure Ej) • 
The image size is 720 x 576. Motion tracks were obtained using the fundamental 
matrix based tracker described in [5|. The robust reconstruction algorithm is 
applied using 100 samples to the subsequence consisting of images 0 to 5. For 
these 6 views, there were 740 tracks of which only 32 were seen in all views. 127 
tracks were seen in 4 or more views. The sequence contains both missing points 
and mismatched tracks. 




Dinosaur sequence results 


basis residuals (pixels) 


all residuals (pixels) 


inliers 


6 points quasi-linear 


0.0443/0.183 


0.401/1.24 


95 


6 points sub-optimal 


0.0443/0.183 


0.401/1.24 


95 


6 points bundle adjustment 


0.0422/0.127 


0.383/1.181 


97 


All points (and cameras) bundled 


0.313 /0.718 


0.234/0.925 


95 



Fig. 6. The top row shows the images and inlying tracks used from the dinosaur se- 
quence. The table in the bottom row summarizes the result of comparing the three 
different fitting algorithms (quasi-linear, sub-optimal, bundle adjustment). There were 
6 views. For each mode of operation, the number of points marked as inkers by the 
algorithm is shown in the third column. There were 127 tracks seen in four or more 
views. 

For the six point RANSAC basis, a quasi-linear reconstruction was rejected 
if any reprojection error exceeded 10 pixels, and the subsequent 3 degrees of 
freedom sub-optimal solution was rejected if any reprojection error exceeded a 
threshold of 5 pixels. These are very generous thresholds and are only intended 
to avoid spending computation on very bad initializations. The real criterion of 
quality is how much support an initialization has. When backprojecting tracks 
to score the reconstruction, only tracks seen in 4 or more views were used and 
tracks were rejected as mismatches if any residual exceed 1.25 pixels after back- 
projection. 

The algorithms of section 1,4.51 (except factorization) are again compared on 
this sequence. The errors are summarized in figure El The last row shows an 
additional comparison where bundle adjustment is applied to all the points and 
cameras of the final reconstruction. Figure El also shows the tracks accepted by 
the algorithm. Figure 0 shows the computed model. 

Remarks entirely analogous to the ones made about the previous sequence 
apply to this one, but note specifically that optimizing makes very little difference 
to the residuals. This means that the quasi-linear algorithm performs almost as 
well as the sub-optimal one. Applying bundle adjustment to each initial 6-point 
reconstruction improves the fit somewhat, but the gain in accuracy and support 
is rather small compared to the extra computational cost (in this example, there 
was a 7-fold increase in computation time). 
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Fig. 7. Dinosaur sequence reconstruction: a view of the reconstructed cameras (and 
points). Left: quasi-linear model, cameras computed from just 6 tracks. Middle: after 
resectioning the cameras using the computed structure. Right: after bundle adjustment 
of all points and cameras (the unit cube is for visualization only). 



The results shown for view 0 to 5 are typical of results obtained for other 
segments of 6 consecutive views from this sequence. Decreasing the number of 
views used has the disadvantage of narrowing the baseline, which generally leads 
to both structure and cameras being less well determined. The advantage of 
using only a small number of points (i.e. 6 instead of 7) is that there is a higher 
probability that sufficient tracks will exist over many views. 

5 Discussion 

Algorithms have been developed which estimate a six point reconstruction over 
771 views by a quasi-linear or sub-optimal method. It has been demonstrated 
that these reconstructions provide cameras which are sufficient for a robust re- 
construction of 77 > 6 points and cameras over m views from tracks which include 
mismatches and missing data. This reconstruction can now form the basis of a 
hierarchical method for extended image sequences. For example, the hierarchi- 
cal method in |^, which builds a reconstruction from image triplets, could now 
proceed from extended sub-sequences over which at least six points are tracked. 

We are currently investigating whether the efficient 3 degree of freedom para- 
metrization of the reconstruction can be extended to other multiple view cases, 
for example seven points over m views. 
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Abstract. This paper considers projective reconstruction with a 
hierarchical computational structure of trifocal tensors that integrates 
feature tracking and geometrical validation of the feature tracks. The 
algorithm was embedded into a system aimed at completely automatic 
Euclidean reconstruction from uncalibrated handheld amateur video 
sequences. The algorithm was tested as part of this system on a number 
of sequences grabbed directly from a low-end video camera without 
editing. The proposed approach can be considered a generalisation of a 
scheme of [Fitzgibbon and Zisserman, ECCV ‘98]. The proposed 
scheme tries to adapt itself to the motion and frame rate in the sequence 
by finding good triplets of views from which accurate and unique trifocal 
tensors can be calculated. This is in contrast to the assumption that three 
consecutive views in the video sequence are a good choice. Using 
trifocal tensors with a wider span suppresses error accumulation and 
makes the scheme less reliant on bundle adjustment. The proposed 
computational structure may also be used with fundamental matrices as 
the basic building block. 

1 Introduction 

Recovery of the shape of objects observed in several views is a branch of computer 
vision that has traditionally been called Structure from Motion (SfM). This is currently 
a very active research area [1-32]. Applications include synthesis of novel views, 
camera calibration, navigation, recognition, virtual reality, augmented reality and 
more. Recently, much interest has been devoted to approaches that do not assume any 
a priori knowledge of the camera motion nor calibration [1-4,6,9,13,20,31]. Thus, both 
the cameras and the structure are recovered. It is therefore relevant to speak of 
Structure and Motion (SaM). These approaches are very promising, especially as part 
of a potential system that extracts graphical models completely automatically from 
video sequences. 

A very brief outline of one of many possible such systems is as follows. Features 
are extracted in all views independently [33,34]. Features are then matched by 
correlation into pairs and triplets from which multiple view entities such as 
fundamental matrices [6,11,29,32,35,36] or trifocal tensors [21,23,25,26] are 
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calculated. Camera matrices are then instantiated in a projective frame according to 
the calculated multiple view entities. The obtained pairs or triplets of camera matrices 
are transformed into a coherent projective frame [1] and optimised via bundle 
adjustment [9]. This yields a projective reconstruction that can be specialised to 
Euclidean by the use of autocalibration [10,16,20,30]. The views are now calibrated 
and a dense graphical model suitable for graphical rendering can be produced with any 
scheme developed for calibrated cameras. Examples of such schemes are space 
carving [15] and rectification [20] followed by conventional stereo algorithms. 

This paper will concentrate on the stage where multiple view entities are calculated 
and registered into a coherent projective frame. Eactorisation approaches are available 
that avoid the registration problems by obtaining all the camera matrices at once 
[19,27,28]. There are, however, compelling reasons for using an iterative approach 
and to build up the camera trajectory in steps. Direct estimation typically expects all 
features to be observed in all views, an assumption that in practice is heavily violated 
for long sequences. With an iterative approach, it is easier to deal with mismatches, 
another key to success in practice. Last but not least, an iterative approach makes it 
possible to use bundle adjustment in several steps and thereby build the reconstruction 
gradually with reprojection error as the criterion. 

The contribution of this paper is a hierarchical computational structure that 
integrates feature tracking and geometrical validation of the feature tracks into an 
iterative process. The algorithm was tested as part of a system that automatically goes 
from frames of a video sequence to a sparse Euclidean reconstruction consisting of 
camera views, points and lines. The rest of the paper is organised as follows. Section 2 
gives a motivation for the approach and the overall idea. The approach relies heavily 
on an algorithm to derive trifocal tensors and feature triples geometrically consistent 
with them. This estimation essentially follows [3,25], but is briefly sketched in Section 
3 due to its importance. The proposed computational structure is given in more detail 
in Section 4. Section 5 describes how the trifocal tensors of the structure are carried on 
to a projective reconstruction. Results and conclusions are given in Sections 6 and 7. 

2 Motivation and Method 

The task of matching e.g. corners or lines between widely separated views by 
correlation is notoriously difficult. Rotation, scaling, differing background, lighting or 
other factors typically distort the texture of the region around a feature. Eeature 
tracking makes it possible to establish matches with larger disparities. However, 
feature tracks are eventually lost and features move out of view in favour of new ones 
as the sequence goes on. Therefore, the number of feature correspondences between 
widely separated views is not always sufficient to allow a reliable estimation of a 
fundamental matrix or trifocal tensor. 

On the other hand, a certain amount of parallax is necessary for a unique 
determination of the camera matrices (up to an overall homography) and the disparity 
between views is the clue to the actual depth of the features. A wider baseline is 
preferable, since it typically increases the amount of parallax and disparity. 

To summarise the discussion, the loss of feature correspondences over time and the 
necessity of a reasonable baseline create a trade off. This means that in practice there 
is a sweet spot in terms of view separation, where calculation of the multiple view 
entities is best performed. 
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Many test sequences used in the vision community are well conditioned in the sense 
that consecutive views are reasonably separated. A typical sequence taken hy a non- 
professional photographer using an ordinary handheld camera does not possess this 
quality. The number of views needed to build up a reasonable separation will typically 
vary over the sequence and depends on the frame rate and the speed and type of 
camera motion. 

A too low frame rate in relation to the motion is of course difficult to remedy. It is 
therefore reasonable to use a high frame rate when acquiring the sequence and to 
develop algorithms that can adapt to the sequence and find the sweet spot in separation 
automatically. It is proposed here to achieve this by a two-step scheme. The first step 
is a preprocessor that can cope with a potentially excessive number of frames. The 
preprocessor is based on a rough global motion estimation between views and discards 
redundant views based on correlation after the motion estimation. The details are 
described in [18]. The second step, which is the topic of this paper, is a hierarchical 
computational structure of trifocal tensors. This structure integrates feature tracking 
and geometrical validation of feature tracks into an iterative process. It can be 
considered a generalisation of [1]. There, projective frames with triplets of camera 
views and corresponding features are hierarchically registered together into longer 
subsequences. The three views of one such triplet are always adjacent in the sequence, 
unless a preprocessor performs tracking. 

In the scheme proposed here, the trifocal tensor algorithm is first used on triplets of 
consecutive views. This produces trifocal tensors together with a number of feature 
triplets consistent with every tensor. The consistent feature triplets from adjacent 
tensors are connected to form longer feature tracks. The longer tracks are then used 
together with new feature triplets, provided by raw matching, as input to new trifocal 
tensor calculations. The new tensors are between views twice as far apart in the 
sequence. This is repeated in an iterative fashion to produce a tree of trifocal tensors. 
From the tree of trifocal tensors, a number of tensors are now chosen that together 
span the sequence. The choice is based on a quality measure for trifocal tensors and 
associated feature triples. The goal of the quality measure is to indicate the sweet spot 
in view separation discussed earlier. The measure should therefore favour many 
consistent features and large amounts of parallax. The resulting tensors will be 
referred to as wide tensors and can stretch over anything from three up to hundreds of 
views, depending entirely on the sequence. 

In this manner, frame instantiation and triangulation are postponed until disparity 
and parallax have been built up. The registration of intermediate views can then 
proceed to provide a denser sequence. In some cases it is desirable to include all 
views, while no intermediate views at all are required in others. The interpolative type 
of registration can sometimes be more accurate than its extrapolative counterpart. By 
taking long steps in the extrapolative registration, error accumulation is suppressed. 
This in turn makes the algorithm less reliant on the bundle adjustment process. 

The success of the algorithm relies on two properties. The first one is that the 
algorithm for the trifocal tensor performs reasonably well at determining the 
consistent triples also when the baseline is insufficient to accurately determine the 
depth of the features or to give a unique solution for the camera matrices. The second 
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one is that when presented with too widely separated views, the trifocal tensor 
algorithm yields a result with very few matches and as a consequence, unreliable 
tensors can be detected. 

It is also in place to point out that an almost identical approach as the one described 
here may be taken with view pairs and fundamental matrices. The advantages of 
using three views as the basic building block are that lines can also be used and that 
the more complete geometric constraint removes almost all the false matches. These 
advantages come at a cost of speed. 

3 Tensor Algorithm 

The way to obtain new features essentially follows [2,3] and is sketched in Figure 1. 
Harris corners [34] are matched by correlation to give pairs upon which a fundamental 
matrix estimation is performed. The estimation is done by RANSAC [37], followed by 
optimisation of a robust support function. 




Figure 1. A sketch of how feature pairs are derived 



The result is a fundamental matrix and a number of geometrically consistent corner 
pairs. Lines derived from Canny edges [33] are matched by correlation, guided by the 
extracted fundamental matrix, as described in [38]. The feature pairs are then 
connected into correspondences over three views. Given a set of feature triples, 
another RANSAC process [25] is conducted to estimate the trifocal tensor. Minimal 
sets consisting of six point triples are used to find tentative solutions for T using the 
method of [21]. The support for the tentative solutions is measured using both points 
and lines. Following [25], a robust support function based on maximum likelihood 
error of points and lines is used. The best tentative solution from the RANSAC 
process is then optimised. For best accuracy, the parameterization used in the 
optimisation should be consistent meaning that all possible choices of values for the 
parameters should yield a theoretically possible tensor. 
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4 Computational Structure 

The proposed computational structure is applicable to any number of views, but as it 
simplifies this description, it is assumed that the sequence length is = 2" + 1 for 
some n N . First, basic tensors {T’(1, 2,3), T’ (3,4,5 .,7’ (A^ 1,N l,A^)}are 
calculated for triplets of adjacent views. Then the next layer 
{7’(1,3,5),7’(5,7,9),...,7’(A^ 4,A^ 2,A^)} of tensors with double baseline (in 

an abstract sense) is found. 



Comer pairs 


Corner pairs 


Line pairs 


Line pairs 


l,m 


m, n 


l,m 


m, n 




Corner and line 
pairs from tensors 
with shorter 



Semi-guided 
threshold I (Connect 



Unguided . 

° 1 1 baseline 

threshold 




Enforce triple geometry 
of corners and lines 

X 

Consistency check 



’■ Enforce triple geometry 
of comers and lines 




Add 




Corner and line triples 

Figure 2. One iterative step of the trifocal tensor computation 



The result of the first layer is passed on to this new layer. More specifically the 
calculation of the trifocal tensor T{^,i + 7.^ ,i + 2^^* ), where i, j N is fed with 
corner and line triples obtained from tensors 7’(/,/ + 2^ *,/ + 2'') and 

+2^ + 2'^^* ) . The calculation of the narrower tensors provides 

a number of consistent corner and line triples. These triples are then connected at 
frame i + 2^ to provide longer triples by simply dropping the nodes at frames 

i + 2^ ^ and i + 2^ +2^ ^ . The longer triples are fed into the wider tensor 
calculation together with new triples extracted by the basic algorithm sketched above 
in Section 3. One iterative step as just described is illustrated in Figure 2. The 
recursion then proceeds until a complete tree of tensors has been built up (see Figure 

3). In this way the total number of tensor calculations becomes 2” 1 = A^ 2 . 

Intertwining the connection of feature triples with the geometric pruning provided by 
the tensor calculation means that geometric constraints and tracking are integrated. 
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Although straightforward connection of feature triples was used in the current 
implementation, more sophisticated schemes could be used, for example by taking 
smoothness of motion into account. 




Figure 3. The hierarchical stmcture of trifocal tensors. The frames of the sequence are 
shown as rectangles at the bottom of the figure 

For a long sequence, the number of consistent features has generally dropped under 
an unacceptable level before the top of the tree with a wide tensor spanning the whole 
sequence can be reached. As mentioned above, there is a sweet spot to be found in the 
width of the tensors. To determine the best width, a quality measure is defined for the 
tensors. A tensor is ruled unacceptable and the quality measure set to zero if the 
number of consistent features is below a threshold. A rather conservative threshold of 
50 was used in the experiments. Otherwise the quality measure is defined as 

Q = b p. (1) 

Flere b is an abstract baseline. It is defined as the distance between the first and 
the last frame of the tensor, but simply measured in frame numbers. The parameter 
is a constant determining how greedy the algorithm is for wider tensors. A reasonable 
choice was found to be 0.5 1 . The parameter p indicates whether there is 

support in the data for a 3-dimensional geometric relationship rather than a 2- 
dimensional. With the use of this parameter, the algorithm tries to avoid degeneracies. 
The parameter p is related to the more sophisticated criterion described in [24]. If 
there is no translation between two camera views, or if all features seen by the views 
are located in a common plane, corresponding points X and X in the first and second 
view are related by a homography H as X Hx where H is represented by a 3x3 

matrix defined up to scale. Likewise I IH ' for two corresponding lines I and I . 
Two homographies are fitted to the feature triples consistent with the trifocal tensor 
(again using RANSAC). The parameter p is then defined as the number of features 
triples that are consistent with the trifocal tensor but inconsistent with any of the 
homographies. 

The consideration of tensors begins at the top of the tensor tree. If a tensor is ruled 
unacceptable or its children of narrower tensors both have a higher quality, the choice 
is recursively dropped to the next level of the tree. If the basic tensors at the bottom of 
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the tree are also ruled unacceptable, the sequence is split into subsequences at the 
views in question. The result is now subsequences of wide tensors. A small example 
can be seen in Figure 4. 




Figure 4. A small example of a choice of wide tensors 



5 Building the Projective Reconstruction 

The wide tensors are registered into a common projective frame following [1]. Before, 
or after this is done, the wide tensors can (optionally) be complemented with some of 
the intermediate views. In the current implementation of the algorithm, this is done 
first. A wide tensor typically has two child tensors in the tensor tree. Each of these 
children provides an extra view and also some additional features. The extra views are 
inserted recursively, deeper and deeper in the tensor tree. One way to accomplish the 
insertion is to register the child tensors into the frame of the parent. However, all 
features and views spanned by a wide tensor can in fact be instantiated directly in a 
common frame. This avoids inconsistency problems and is done in a similar fashion as 
for sequential approaches [2,9]. The differences are that the process is now 
interpolative rather than extrapolative plus that reliable correspondences are already 
extracted. 

A tensor that provides the extra view to be filled in consists of the extra view plus 
two views that are already instantiated. The additional view is positioned between the 
old ones. Furthermore, all the new features that are provided have been found 
consistent with the trifocal tensor. Therefore all the new features have two 
observations in the already instantiated views. Thus they can be triangulated directly 
[12]. Once this is done, the new view can be determined from the features seen in it 
through another RANSAC process followed by optimisation. 

It should be remarked that once trifocal tensors spanning the video sequence are 
known, the camera matrices are in theory determined without further consideration of 
the features. However, the consistent use of reprojection error has been found crucial 
in practice. Furthermore, error accumulation is suppressed by bundle adjustment after 
every merge. 

In the process of intermediate view insertion, a quality assurance mechanism can be 
included that rejects the introduction of a new view and stops the recursion at the 
current depth if there are indications of failure. Depending on the application, it can 
also be desirable to limit the insertion to a certain number of steps. A wide tensor 
spanning many views typically indicates that there is little motion between the views 
of the original sequence. Thus, when it is desirable to homogenise the amount of 
motion between views, this can be accomplished by limiting the depth of insertion. 

Finally, the wide tensors with associated features are registered into a common 
projective frame. Also here, it is beneficial to use some kind of heuristic as to whether 
a merge should be accepted or not. In case of failure, it is better to discard the merge 




656 D. Nister 



and produce several independent reconstructions, or to prompt for a manual merge. To 
build reliable automatic reconstruction systems, work on quality monitoring of this 
kind is essential. 

6 Results 

Experiments were performed on approximately 50 sequences. The projective 
reconstructions are specialised to Euclidean with the use of autocalibration. A fixed 
focal length, fixed principal point and zero skew are forced upon the model. The result 
is then bundle adjusted in the Euclidean setting. Some results are shown in Table 1. 



Table 1. Results from five sequences. The table is explained below 



Sequence 


Frames 


Views 


Points 


Lines 


P_error 


L_error 


Figure 


Nissan 

Micra 


1-28 


17 


679 


93 


0.73 


0.007 


- 


28-235 


89 


4931 


170 


1.11 

(0.67) 


0.039 

(0.017) 


- 


235-340 


59 


3022 


226 


0.79 


0.024 


6 


Flower 

Pot 


1-115 


61 


3445 


23 


0.68 


0.029 


8 


122-180 


43 


3853 


1 


0.65 


0.000 


- 


180-267 


83 


10068 


0 


0.67 


- 


9 


Swedish 

Breakfast 


1-64 


41 


3071 


273 


0.74 


0.024 


11 


64-249 


125 


8841 


739 


0.75 


0.019 


- 


249-297 


29 


749 


240 


0.68 


0.025 


- 


Bikes 


1-161 


103 


8055 


369 


0.73 


0.018 


13 


David 


1-19 


11 


688 


38 


0.52 


0.020 


15 


Shoe & Co 


19-39 


21 


942 


55 


0.67 


0.032 


17 



All sequences in the table have the resolution 352 x 288 pixels. The column 
‘Erames’ shows the frame span in the original sequence. Due to preprocessing all 
frames are not used. The number of views, points and lines in the reconstruction are 
displayed in the columns ‘Views’, ’Points’ and ‘Lines’. ‘P_error’ is the root mean 
square point reprojection error in number of pixels. ‘L_error’ is the root mean square 
line reprojection error. The line reprojection error is measured as the length of the 
vector I I , where I and I are the observed and reprojected line, represented as 
homogenous line vectors normalised to hit the unit cube. The column ‘Eigure’ 
indicates in which figure the reconstruction is displayed graphically. Observe that the 
reprojection error of the second sub-sequence of ‘Nissan Micra’ is higher than the 
other results. This is due to a failure of the autocalibration. The camera trajectory is 
torn into two separate parts. The reprojection error in the projective frame is therefore 
shown in parenthesis. 

The sequence ‘David’ was taken by the person in the sequence, stretching out an 
arm with the camera and moving it in an arc. The sequence ‘Shoe & Co’ is a ‘home 
made’ turntable sequence. It was taken in front of a refrigerator with the camera in a 
kitchen cupboard. The turntable is a revolving chair that was turned by hand with the 
camera 
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Figure 7. Some frames from the sequence Flower Pot 




Figure 8. Two views of a reconstruction from the sequence Flower Pot 




Figure 9. View of a reconstruction from the sequence Flower Pot 
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Figure 10. Some frames from the sequence Swedish Breakfast 





running. The remaining sequences are handheld. Figures 5-17 display results 
graphically. In some of the figures it is difficult to interpret the structure. What can be 
seen in all of them however, is that the extracted camera trajectory is good. This is the 
most important result. The algorithm is mainly intended for intrinsic and extrinsic 
camera calibration. In a complete system, it should be used together with a dense 
reconstruction algorithm that uses the calibration. It is not clear though, that the dense 
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Figure 12. Some frames from the sequence Bikes 



Figure 13. View of a reconstruction from the sequence Bikes 



Figure 14. Some frames from the sequence David 



Figure 15. Two views of a reconstruction from the sequence David 
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reconstruction algorithm should necessarily use the structure. Dense reconstruction is 
part of the future work and out of the scope of this paper. 

7 Conclusions 

A hierarchical computational structure of trifocal tensors has been described. It is used 
to solve for structure and motion in uncalibrated video sequences acquired with a 
handheld amateur camera. In agreement with this purpose, the algorithm was tested 
mainly on video material with different amounts of motion per frame, frames out of 
focus and relatively low resolution. With the presented computational structure, the 
instantiation of camera matrices and feature triangulation are held until disparity and 
parallax have been built up. The structure also integrates feature tracking and 
geometrical constraints into an iterative process. Experimental results have been 
shown in terms of sparse Euclidean reconstructions consisting of views, points and 
lines. The presented results are taken from experiments on approximately 50 
sequences. Future work includes more sophisticated ways to monitor the quality of the 
results, work on the autocalibration stage of the reconstruction and also the use of a 
method to derive a dense reconstruction. 
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Abstract. There have been relatively little works to shed light on the 
effects of errors in the intrinsic parameters on motion estimation and 
scene reconstruction. Given that the estimation of the extrinsic and in- 
trinsic parameters from uncalibrated motion apts to be imprecise, it is 
important to study the resulting distortion on the recovered structure. 
By making use of the iso-distortion framework, we explicitly charac- 
terize the geometry of the distorted space recovered from 3-D motion 
with freely varying focal length. This characterization allows us: 1) to 
investigate the effectiveness of the visibility constraint in disambiguating 
uncalibrated motion by studying the negative distortion regions, and 2) 
to make explicit those ambiguous error situations under which the visibi- 
lity constraint is not effective. An important finding is that under these 
ambiguous situations, the direction of heading can nevertheless be accu- 
rately recovered and the structure recovered experienced a well-behaved 
distortion. The distortion is given by a relief transformation which pre- 
serves ordinal depth relations. Thus in the case where the only unknown 
intrinsic parameter is the focal length, structure information in the form 
of depth relief can be obtained. Experiments were presented to support 
the use of the visibility constraint in obtaining such partial motion and 
structure solutions. 

Keywords: Structure from motion, Depth distortion. Space perception. 
Uncalibrated motion analysis. 



1 Introduction 



While there have been various works on the self-calibration problem, most face 
difficulties in estimating the intrinsic parameters accurately. Two courses are 
open to researchers. One approach is to enforce special camera displacements to 
obtain better estimates of the intrinsic parameters jllbll Illbl21)j . Another ap- 
proach argues that as far as scene reconstruction is concerned, several weaker 
structures (Projective, Affine) can be obtained without complete recovery of the 
intrinsic parameters. While the mainstay of the research efforts adopts the di- 
screte approach, |3f2[)j have recently formulated the problem in the continuous 



D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 664- l?T77l 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 




Characterizing Depth Distortion Due to Calibration Uncertainty 665 



domain. Most of the schemes presented assume that the intrinsic parameters 
across the frames are constant. A more general treatment of the problem, allo- 
wing for varying intrinsic parameters, is given in |2I,SI1 Oil 2H 511 7I2()| . 

There have been relatively little works to shed light on the effects of errors in 
the intrinsic parameters on motion estimation and scene reconstruction. Florou 
and Mohr used the statistic approach to study reconstruction errors with res- 
pect to calibration parameters. Svoboda and Sturm m studied how uncertainty 
in the calibration parameters gets propagated to the motion parameters. Vieville 
and Faugeras studied the partial observability of rotational motion, calibration, 
and depth map in 1201. Bougnoux P] offered a critique of the self-calibration pro- 
blem, finding that the estimation of various intrinsic parameters are unstable. 
However, it was observed, partly empirically, that despite uncertainty in the focal 
length estimation, the quality of the reconstruction does not seem to be affec- 
ted. Certain geometrical properties such as parallelism seemed to be preserved. 
Aside from this observation, there has not really been an in-depth geometrical 
characterization of the errors in the reconstructed depth given some errors in 
both the intrinsic and the extrinsic parameters. 

In this paper, we consider the common situation where all of the intrinsic 
parameters are fixed except the focal length. The focal length can be freely 
varying across frames, resulting in a zoom field (considering infinitesimal motion) 
which is difficult to separate from that of a translation along the optical axis. 
This, together with the perennial problem of the coupling between translation 
and rotation, means that distortion in the recovered structure is likely to be 
present. 

This paper attempts to make the geometry of this distortion explicit by 
using the iso-distortion framework introduced in The motivation for perfor- 
ming this analysis is twofold: first, to seek to characterize the distortion in the 
perceived depth; second, to extend our understanding on how depth distortion in 
turn interacts with motion (including zoom) estimation. It is an alternative look 
at the problem of depth representation from the usual stratified viewpoint |S|, 
but one that will inform one another. 

This paper is structured along the following lines. First comes some prelimi- 
naries regarding the iso-distortion framework in Section 2, followed by an exten- 
sion of this framework to the self-calibration problem. Several major features of 
the resulting distortion are then made explicit. The main goals of Section 3 are 
(1) to elaborate the relations between the depth distortion and the estimation 
of both the intrinsic and extrinsic parameters; and (2) to study certain well- 
behaved depth distortion resulting from ambiguous solutions. Section 4 presents 
experiments to support the use of the visibility constraint in obtaining partial 
solutions to the estimation of both motion and structure. The paper ends with 
a summary of the work. 
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2 Iso-Distortion Framework 

2.1 Pre-requisites 

If a camera is moving rigidly with respect to its 3D world with a translation 
([/, V, W) and a rotation (a, f3, 7 ), together with a zooming operation, the resul- 
ting optical flow (u, v) at an image location (x, y) can be extended from its basic 
form to include the following terms: 

u=^{x-Xo) + ^a- f{l+^)!3 + -iy+^x ( 1 ) 

= y(y- 2/0) - y /3 + /(I + ^)a- 72;+ yy (2) 

where / is the focal length of the camera; 

/ is the rate of change of the focal length; 

(xo,yo) = (fw^fw^ Focus of Expansion (FOE) of the flow fleld; 

and Z is the depth of scene point. 



2.2 Space Distortion Arising from Uncalibrated Motion 

In a recent work |^, the geometric laws under which the recovered scene is di- 
storted due to some errors in the viewing geometry is represented by a distortion 
transformation. It was called the iso-distortion framework whereby distortion in 
the perceived space can be visualized by families of iso-distortion lines. In the 
present study, this framework has been extended to characterize the types of 
distortion experienced by a visual system where a change in the focal length 
may result in further difflculties and errors in the estimation of its calibration. 

From the well-known motion equation, the relative depth of a scene point 
recovered using normal flow with direction (n^,ny) may be represented by 

^ ^ (x - xp,y - yo).(n^,ny) 

(Un - (Ur + Uf).(na:,ny)) 

where u„ is the normal flow magnitude; 

Ur is the rotational flow; and 

Uf is the zoom flow caused by a change in the focal length. 



If there are some errors in the estimation of the intrinsic or the extrinsic pa- 
rameters, this will in turn cause errors in the estimation of the scaled depth, and 
thus a distorted version of space will be computed. By representing the estima- 
ted parameters with the hat symbol (~) and errors in the estimated parameters 
with the subscript e (where error of any estimate p is defined as Pe = p — p), the 
estimated relative depth Z may be expressed in terms of the actual depth Z as 
follows: 
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Z=Z{ 



{x - Xo)n^ + {y- Vo)ny 



{x - XQ,y - yo).{nx,ny) + Ure-{jix,ny)Z + Ufe.{jix,ny)Z + NZ 



) ( 4 ) 



where Ure = Ur — Ur 

= if {a -a)- /(I +^){(3-p) + y{^ - ^), 

=f{J3 -P) + /(I + y^){a -a)- x{j - 7)); 

Ufe = uf -Uf = {^x, ^y); and 

N is a, noise term representing error in the estimate for normal flow value. 

Equation @ shows that errors in the motion estimates distort the recovered 
relative depth by a factor D, given by the terms in the bracket: 



^ (x - xo)nj; + {y- yo)ny 

(x - Xo,y- yo)-{nx, ny) + Ure-ijlx, Uy)Z + Ufe.{rix, Uy)Z + NZ 

Equation describes, for any fixed direction (rix,ny) and any fixed distortion 
factor D, a surface /(x, y,Z) = 0 in xy.Z-space, which has been called the iso- 
distortion surface. For specific values of the parameters xq, yo, xq, yo, ae, Pe, 
7e> /) f, f and {nx,riy), this iso-distortion surface has the obvious property 
that points lying on it are distorted in depth by the same multiplicative factor 
D. The distortion of the estimated space can be studied by looking at these 
iso-distortion surfaces. In order to present the these analyses visually, most of 
the investigation will be conducted by initially considering {nx,Uy) to be in 
the horizontal direction. Ignoring the noise term N , we get the following set of 
equations for different values of D-. 



X = Xq 
y xn — x 

Z-J „ I 

ry 1 — D ( X—Xq 



JP 

D 



if D = 0 
if Z? — >■ ±oo 






) 



otherwise 



( 6 ) 



where the superscript x indicates the projection of vector onto the horizontal 
direction. 



Now, if we were to consider the held of view of the camera to be small and 
ignore the effect of 7 e so that Ure becomes {—Pef,ctef), we have 



X = Xq 



if D = 0 



Z = 



Xq — X 
-fPe+SeX 

1-D ( x-xn ) I 1 / xne ) 
D i—fPe+5^x'' D'^—fPe+Sex’ 



if D ^ ±oo 
otherwise 



where Se = 



( 7 ) 



These equations describe the iso-distortion surfaces as a set of surfaces per- 
pendicular to the x-Z plane. Much of the information that equations o contain 
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can thus be visualized by considering a family of iso-distortion contours on a 
two-dimensional x-Z plane. Each family is defined by four parameters: xO and 
the three error terms xOe, Pe and Se- Within each family, a particular D defines 
an iso-distortion contour. Figure ^ corresponds to two particular cases of the 
iso-distortion contours. In the next subsection, we shall determine the salient 
geometrical properties of the iso-distortion contours. 





(a) (b) 

Fig. 1. Iso distortion contours for (a) xo = 30, xoe = 50, /?e = 0.001, Se = 0.05 and 
/ = 350. (b) When Se = —0.05. (Shaded region corresponds to the negative depth 
region.) 



2.3 Salient Properties 

Several salient features can be identified from the plot. 

1) The D = 0 curve is a vertical line that intersects the x-axis at Xq. Any 
change in the estimated FOE will slide this line along the x-axis. 

2) The D = ±oo curve intersects the x-axis at xq and approaches to Z = ^ as 
X tends to infinity. The structure of the D = ±oo contour is independent of 
the position of the estimated FOE but on the true FOE location. 

3) The contours intersect at a singular point where x = Xq and Z = ^ ■ 

At this point, the depth of the scene is undefined. 

4) The vertical asymptote for all contours where H 0 is x = 

5) The horizontal asymptote for each contours where yf 0 is the line Z = 

Hence, each contour has a different horizontal asymptote depending on 
the values of D. However, the horizontal asymptote for the contour = 1 is 
always the x-axis, independent of other parameters. A diminishing value in 
Se simultaneously moves the horizontal and vertical asymptotes away from 
the image center. It approaches the iso-distortion configuration under the 
case of calibrated motion ^ . 
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3 What Can the Distortion Contonrs Tell Us? 

3.1 Confusion between Translation, Rotation, and Zoom 

It is well-known that we cannot numerically raise the ambiguity between several 
intrinsic and extrinsic parameters. For instance, there is a strong ambiguity 
between a translation along the optical axis and a zoom. In this subsection, we 
briefly look at the numerical aspect of this problem before studying how the 
iso-distortion framework can be utilized to yield useful information. 

One straightforward way of estimating the motion parameters would be to 
hypothesize a FOE. Flow vectors are then projected to the gradient direction n® 
perpendicular to the emanated lines from the hypothesized FOE so that they 
contain only the rotational and the zoom held (if the hypothesized FOE were 
correct). The least square formulation is as follows: 



-2 ' 2 i i 




' a~ 


(^, /(I + ^)).n® (-/(I + ^), -^).n® {y\ -x®).n® {x\ y®).n® 




p 






1 






f 




L / J 



(u®, u®).n® 



( 8 ) 

The conjecture is that, for the correct FOE candidate, the residual should 
be the smallest among all candidates, since the least square equations correctly 
model the situation. Thus, the least square residual furnishes a feasible measure 
upon which to base the FOE candidate selection. However, this formulation has 
several problems. Consider the true FOE candidate: as far as this candidate is 
concerned, the last column of the the matrix in Equation (0) can be rewritten 
as {xo,yo)-n^ since in this case (a;®,y®).n® = {xo,yo).ri^ ,'ii. Thus, when the 
FOE is at the image center (i.e. (xq, ?/o)=(0,0)), the least square estimation in 
Equation Q becomes rank deficient and this gives an infinite set of solutions. 
Hence, from the least-square fitting formulation, we are not able to separate a 
pure forward translation from a zoom. Now, if the FOE lies near to the image 
center, the least square estimation becomes unstable, which can be analyzed by 
using the concept of condition number. Similar conclusions can be derived for 
various other algorithms, for instance that of Brooks 0. 

Thus, the least-square fitting residual cannot be the final criteria for choosing 
the correct FOE. Rather, we have to look for additional constraint to prune the 
set of possible solution candidates. 

3.2 The Visibility Constraint 

Direct motion algorithms 1 1 ,'ilYj often attempt to And the solution by minimizing 
the number of negative depth found. This is known as the visibility constraint. 
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However its usage in the estimation of uncalibrated motion is relatively unex- 
plored. We would now like to examine this constraint in the light of the negative 
distortion region. The geometry of the negative distortion region allows us to 
examine these questions: does the veridical solution have the minimum number 
of negative depth? Are there combination of estimation errors such that the vi- 
sibility constraint is not sufficient to discriminate between alternative solutions? 
Do these ambiguous solutions exhibit any peculiar properties in terms of their 
recovered structure or their motion estimates? To study these questions, we con- 
sider the distortion plot for the horizontal flow first, before considering those in 
other directions. 



3.3 Constraints on Motion Errors 

Comparing Figures [D (a) and (b), the first observation is that if a particular 
solution has a negative the distortion plot will be such that the majority of 
the negative distortion regions lie in front of the image plane. Irrespective of the 
actual scene structure, there will be a large number of negative depth estimates 
obtained, thereby ruling out that particular solution. Thus the first condition 
on the motion errors for ambiguity to arise is that the zoom error flow must be 
such that: 



> 0 



( 9 ) 



in which case most of the negative distortion region lie behind the image plane. 
What remains in front of the image plane is a band of negative distortion region, 
bounded by two contours, the D — 0 and the D = — oo contours, whose equations 
are respectively x = Xq and Z = ^ ■ The latter cuts the horizontal axis 

at X = xq and its vertical asymptote is given by a: = We now derive 
the combination of errors such that this negative band will be minimized (i.e. 
ambiguity is maximized). 

To derive these combinations, we first arbitrarily fix the error 5^ and suppose 
it satisfies the constraint given in (El). The constraints on the other parameters 
j3 and xq that will yield minimum negative distortion region depend on whether 
an algorithm solves for these parameters separately or simultaneously: 

1) If /3 is solved first and the estimate contains an error /3e, then the xq that 
minimizes the negative depth region, given these fixed Sg and /3e, is: 



Xo = 



2xq -j- (Zfj2ax + Zmin)fPe 



{Z„ 



^)Se 



where we have assumed that depths in the scene are uniformly distributed 
between Zmin and Z^ax- See Figure |3)a). This Xq always lies between xq 
and the vertical asymptote. Similar condition on /3e can be derived if we 
solve for xq first. 
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(a) (b) 

Fig. 2. Configuration for Minimum Negative Depth, (a)fixed <5e and /3e- (b) fixed Se. 



2) If both (3 and xq are solved together, then the solution that minimizes the 
negative depth region is given by 




that is, the cc-component of the direction of heading is recovered veridically, 
and the vertical asymptote x = coincides with the line x = Xq = xq 
so that the D = 0 and the D = —oo contours coincide. In this case, the 
negative band in front of the image plane vanishes. See Figure El^b). 

3) Furthermore, if the lines x = xq,x = xq and x = are out of the image 
(and on the same side), then even if they do not meet, the negative distortion 
band would be outside the field of view. Thus, this solution will not yield 
any negative depth estimates and would be totally ambiguous too. 



To both summarize and to complete the analysis, we consider flows in any other 
gradient direction. The preceding conditions can be generalized as: 



(a;o,yo) = (^o,yo) = > 0 (10) 



a solution which correctly estimates the direction of heading. (/?e, — Oe) must be 
in the same direction as (xo,yo) so that for any gradient direction (ux^ny), the 
condition satisfied for the same Sg. In this case, 

the negative distortion region vanishes in front of image plane for all gradient 
directions. 



3.4 Distortion of Recovered Structure 

The preceding analysis shows that the use of the visibility constraint does not 
lift the ambiguities that exist among various kinds of motions. However it does 
restrict the solution set so that those yielding the minimum negative depth esti- 
mates possess certain nice properties, such as the direction of heading is correctly 
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estimated. Furthermore, as can be seen from Figure 0 the iso-distortion contours 
become horizontal, resulting in well-behaved distortion. Indeed, the distortion 
factor D in this case has the form of a relief transformation , where a = 1 
and b = 5^. This relief transformation preserves the ordering of points; its ge- 
neral properties were recently discussed and analyzed by As a result of 

this distortion, the reconstructed scene may appear visually perfect even though 
the depths have been squashed to various degrees. It is of interest to compare 
this result with that demonstrated by Bougnoux P] : that the uncertainty on the 
focal length estimation leads to a Euclidean calibration up to a quasi anisotropic 
homothety, which in turn yields visually good-looking reconstruction. 



4 Experiments 

This section presents the experiments carried out to support the theoretical 
findings established in the preceding section. Specifically, we demonstrate the 
ability to correctly estimate the heading direction of the camera based on mi- 
nimizing the number of negative depth estimates. The distortion effects due to 
erroneous motion estimates on simple surfaces were also tested. In our experi- 
ments, both synthetic and real images were used. 



4.1 Synthetic Images 

A set of noise-free synthetic images with dimension 240 pixels by 320 pixels were 
generated. The focal length of the projection was fixed at 600 pixels. This gave 
a viewing angle of near 30°. Three simple planes with different orientation at 
different 3-D depth were constructed in the image. The true FOE was located 
at (65,0) of the image plane and the rotational parameters (a,/3,7) have the 
values (0, —0.00025, 0). There was no change in the focal length (i.e. j = 0). 

We first arbitrarily fixed the error Se to be some positive number. We then 
solved for the rotational parameter (/3) and the FOE (aio) in the following man- 
ner: For each hypothesized /3, we selected the best xq candidate such that the 
minimum number of negative depth estimates was obtained. The search range 
for P lies between -0.005 to 0.005. The top ten candidates with the least amount 
of negative depth estimates are tabulated in Table 0 

In this experiment, all ten candidates gave no negative depth estimates. One 
possible explanation is that there is a lack of depth variations in the synthetic 
image (the range of the motion flow lies between 0.000167 to 1.170667). Another 
observation is that the selected xq for any /3 always resides between the vertical 
asymptote ^ and the true FOE xq. 

Figure Ela) depicts the variations of the percentage of negative depth esti- 
mates as a function of the estimated xq when the vertical asymptote is at the 
veridical FOE position. The figure corroborates our theoretical predictions that 
the least amount of negative depth estimates (in this case 0) is obtained when 
Xq = Xq- Similar curves are obtained when the vertical asymptote is not at 
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the veridical FOE position. In theory, if there were sufficient feature points, the 
minimum points of these curves should be higher than those of the former. 

Using the erroneous motion estimates xq, ff3 and 6e that resulted in the 
least amount of negative depth estimates, we attempted to reconstruct the syn- 
thetic planes. Figure m shows the plan view of the three synthetic planes, 
together with their reconstructed versions. It can be seen that the relief of the 
plane remained unchanged after the transformation, i.e., the ordinal depth were 
preserved. Note that the metric aspect of the plane orientations (their slants) 
was altered. This change can be related to the calibration uncertainties via the 
complex rational function given in Equation 0 - 



Table 1. Top ten candidates in the synthetic image experiment, xo = 65. 





58 


59 


60 


61 


62 


63 


64 


65 


66 


67 


Xq 


63 


64 


64 


64 


64 


65 


65 


65 


65 


66 




(a) (b) 

Fig. 3. (a) Variations of amount of negative depth estimates as a function of Fo for 
synthetic images, (b) Synthetic planes (A,B and C) with their reconstrncted versions 
(dotted lines) 



4.2 Real Images 

For real images, we used a sequence whose dimension has been scaled down to 
287 pixels by 326 pixels. The focal length and the field of view of the camera 
were respectively 620 pixels and approximately 30°. The rotational parameter 
(a,P,"f) was (-0.00013,-0.00025,0) and / = 0. The true FOE was at image loca- 
tion (65,73). A similar experimental procedure to the synthetic experiment was 
applied to the real images. The top 10 candidates obtained are shown in Table El 
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When the vertical asymptote was at 64, the variations of the percentage of 
negative depths obtained as a function of the estimated xg is plotted in Fi- 
gure 21 a). We obtained a similar curve to the synthetic case. However, due to 
the presence of noise, the minimum amount of negative depth obtained was not 
zero and the best xg was located at 64 (about 1 pixel from the true Xq)- The 
procedure was repeated to locate the y-component of the FOE using the vertical 
component of the normal flow vectors. The positions of the true and estimated 
FOE are plotted in Figure El(b). 



Table 2. Top ten candidates for the real image experiment, xo = 65. 



<!p 


62 


61 


63 


59 


60 


64 


45 


75 


67 


76 


Xo 


64 


63 


65 


61 


62 


66 


46 


78 


69 


79 


NegDepth{%) 


0.097 


0.110 


0.117 


0.130 


0.145 


0.147 


0.152 


0.154 


0.157 


0.159 




(a) (b) 

Fig. 4. (a) Variations of negative depth as a function of xo for real images, (b) Positions 
of true and estimated FOE in the real image. (-1- : true FOE at (65,73); x ; Estimated 
FOE at (64,74)) 



4.3 Discussions 

The results obtained seem to corroborate the various predictions made in this 
paper. In particular, while the use of visibility constraint cannot be used to effect 
a full recovery of all the parameters, minimizing the number of negative depth 
estimates do result in certain nice properties of the solutions. It seems that at 
least in the case where the only unknown intrinsic parameter is the focal length, 
structure information in the form of depth relief can be obtained from the motion 
cue. The reconstructed depths did look visually alright due to the presentation 
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of the depth relief. The results also validate the assumptions made in this paper 
(that quadratic terms in the flow can be discarded), at least for field of view up 
to 30°. 

There are many problems that plague real image. Foremost among these is 
the presence of noise. The effect of any noise iV at a particular image pixel is 
to replace the term //?e in the numerator of the vertical asymptote x = by 
//3e — N . Thus, to this particular image point, its effective vertical asymptote 
has shifted and part of the problem lies in that this shift has different effects 
on the various solution candidates. For the case of those solutions where the 
negative distortion region in front of the image plane would have vanished under 
noiseless conditions, this noise-induced shift away from x = xq may result in 
that particular depth estimate becoming negative again (depending on the sign 
and magnitude of that N). For the case of other solutions, this shift may have 
the contrary effect of moving that point out of the negative distortion region. It 
becomes plausible that the “desired” solutions (i.e. those satisfying dmi) may 
not have the minimum number of negative depth estimates. Thus, the overall 
effect of noise is to reduce the effectiveness of the visibility constraint in getting 
the “desired” solutions. In our experiment, we found that as long as f/3e, the 
error in the rotational flow, is large enough so that f/3e — N ^ fPe, the location 
of the FOE (a;o, Vo) could be determined quite accurately. 

Other confounding factors for real images include the sparse distribution 
of scene features. It holds that while the underlying negative distortion region 
may have increased in size, there may not be any increase in the number of 
negative depth estimates, due to a lack of scene point residing in the negative 
distortion regions. Evidently, under such circumstances, the number of negative 
depth estimates may not exhibit a monotonic increase as the error in the FOE 
increases. Inspecting the top candidates selected in Table 0 we observed that 
this is indeed the case as xq moves away from Xq- 

5 Conclusions and Future Directions 

This paper represents a first look at the distortion in the perceived space resul- 
ting from errors in the estimates for uncalibrated motion. The geometry of the 
negative distortion region allows us to answer questions such as whether the vi- 
sibility constraint is adequate for resolving ambiguity. It is also found that while 
Euclidean reconstruction is difficult, the resulting distortion in the structure 
satisfies the relief transformation, which means that ordinal depth is preserved. 

A concluding caveat is in order concerning real zoom lens operation. The prin- 
cipal point often changes when the focal length varies. Hence, an analysis based 
on a more detailed distortion model should be carried out. Our iso-distortion mo- 
del can be readily extended to take into account these changes and this would 
be our future work. 

To close this paper, the remark should be added that there are many potential 
applications of the results of our research to areas like multimedia video indexing, 
searching and browsing, where it is common practice to use zoom lenses. It is 
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desirable to incorporate partial scene understanding capabilities under freely 
varying focal length, yet without having to go through elaborate egomotion 
estimation to obtain the scene information. The conclusion of this paper is that 
while it is very difficult to extract metric scene descriptions from video input, 
qualitative representations based on ordinal representation constitute a viable 
avenue. 
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Abstract. We describe in this paper closed-form solutions to the follo- 
wing problems in multi-view geometry of n’th order curves: (i) recovery 
of the fundamental matrix from 4 or more conic matches in two views, 
(ii) recovery of the homography matrix from a single n’th order (n > 3) 
matching curve and, in turn, recovery of the fundamental matrix from 
two matching n’th order planar curves, and (iii) 3D reconstruction of a 
planar algebraic curve from two views. 

Although some of these problems, notably (i) and (iii), were introduced 
in the past [1 .113] . our derivations are analytic with resulting closed form 
solutions. We have also conducted synthetic experiments on (i) and real 
image experiments on (ii) and (iii) with subpixel performance levels, thus 
demonstrating the practical use of our results. 



1 Introduction 



A large body of research has been devoted to the problem of computing the 
epipolar geometry from point correspondences. The theory of fundamental ma- 
trix and its robust numerical computation from point correspondences are well 
understood mm- The next natural step has been to address the problem of 
lines or point-lines correspondences. It has been showed in that case three views 
are necessary to obtain constraints on the viewing geometry 

Since scenes rich with man-made objects contain curve-like features, the next 
natural step has been to consider higher-order curves. Given known projection 
matrices (or fundamental matrix and trifocal tensor) j23ll show how to 
recover the 3D position of a conic section from two and three views, and |2S! 
show how to recover the homography matrix of the conic plane, and [npirn] shows 
how to recover a quadric surface from projections of its occluding conics. Re- 
construction of higher-order curves were addressed in |3j and in IZ2E]. In P] 
the matching curves are represented parametrically where the goal is to find a 
re-parameterization of each matching curve such that in the new parameteriza- 
tion the points traced on each curve are matching points. The optimization is 
over a discrete parameterization, thus, for a planar curve of degree n, which 
represented by |n(n -|- 3) points, one would need n(n -I- 3) minimal number of 
parameters to solve for in a non-linear bundle adjustment machinery — with 
some prior knowledge of a good initial guess. In the reconstruction is done 
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under infinitesimal motion assumption with the computation of spatio-temporal 
derivatives that minimize a set of non-linear equations at many different points 
along the curve. Finally, there have been attempts also uni to recover the fun- 
damental matrix from matching conics with the result that 4 matching conics 
are minimally necessary for a unique solution — albeit, the result is obtained 
by using a computer algebra system. The method developed there is specific to 
conics and is thus difficult to generalize to higher-order curves. 

In this paper we treat the problems of recovering fundamental matrix, homo- 
graphy matrix, and 3D reconstruction (given fundamental matrix) using mat- 
ching curves (represented in implicit form) of n’th order arising from planar n’th 
order curves. The emphasis in our approach is to produce closed form solutions. 
Specifically, we show the following three results: 

1. We revisit the problem of recovering the fundamental matrix from matching 
conics m and re-prove, this time analytically, the result that 4 matching 
conics are necessary for a unique solution. We show that the equations neces- 
sary for proving this result are essentially the kruppa’s equations m which 
are well known in the context of self calibration. 

2. We show that the homography matrix of the plane of an algebraic curve of 
n’th order (n > 3) can be uniquely recovered from the projections of the 
curve, i.e., a single curve match between two images is sufficient for solving 
for the associated homography matrix. Our approach relies on inflection and 
singular points of the matching curves — the resulting procedure is simple 
and is closed- form. 

3. We derive a simple algorithm(s) for reconstructing a planar algebraic curve 
of n'th order from its projections. The algorithms are closed-form where the 
most “complicated” stage is finding the roots of a uni-variate polynomial. 

We have conducted synthetic experiments on recovery of fundamental matrix 
from matching conics, and real imagery experiments on recovering the homogra- 
phy from a single matching curve of 3’rd order, and reconstruction of a 4’th order 
curves from two views. The later two experiments display subpixel performance 
levels, thus demonstrating the practical use of our results. 



2 Background 

Our algorithms are valid for planar algebraic curves. We start by presenting 
an elementary introduction to algebraic curves, and then some introductory 
properties about two images of the same planar curve useful for the rest of our 
work. More material can be found in HH. 



2.1 Planar Algebraic Curves 

We assume that the image plane is embedded into a projective plane. We assume 
that the ground field is the field of complex numbers. This makes the formulation 
simpler. But eventually we take into account only the real solutions. 
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Definition 1 Planar algebraic curve 

A planar algebraic curve C is a subset of points, whose projective coordinates 
satisfy an homogeneous polynomial equation: f{x,y,z) = 0. The degree of f is 
called the order of C. The curve is said to be irreducible, when the polynomial f 
cannot be divided by a non-constant polynomial. 

We assume that all the curves we are dealing with are planar irreducible 
algebraic curves. Note that when two polynomials define the same irreducible 
curve, they must be equal up to scale. For convenience and shorter formulation, 
we define a form f G C[x, y, z] of degree n to be an homogeneous polynomial in 
X, y, z of total degree n. 

Let C be a curve of order n and let £ be a given line. We can represent the 
line parametrically by taking two fixed points a and b on it, so that a general 
point p (except b itself) on it is given by a+ Ab. The intersections of C and C 
are the points {pa}, such that the parameters A satisfy the equation: 



J(A) = f{ax + Xbx, tty + Xby, Uz + Xbz) — 0 



Taking the first-order term of the Using a Taylor-Lagrange expansion: 

JW = /(a) + + y^{a)by -\- ^{a)bz) = f (a) AV/(a).b = 0 

If /(a) = 0, a is located on the curve. Furthermore let assume that V/(a).b = 
0, then the line C and the curve C meet at a in two coincident points. A point 
is said to be regular is V/(a) yf 0. Otherwise it is a singular (or multiple) point. 
When the point a is regular, the line C is said to be tangent to the curve C at a. 

Since the fundamental matrix is a mapping from the first image plane into 
the dual of the second image plane, which is the set of lines that lie on the second 
image, it will be useful to consider the following notion: 

Definition 2 Dual curve 

Given a planar algebraic curve C, the dual curve is defined in the dual plane, 
as the set of all lines tangent to C. The dual curve is algebraic and thus can be 
described as the set of lines {u, v, w), that are the zeros of a form 4>{u, v,w) =0. 
IfC is of order n, its dual curve T> is of order less or equal to n{n — 1). 

We will also need to consider the notion of inflexion point: 

Definition 3 Infiexion point 

An infiexion point a of a curve C is a simple point of it whose tangent intersects 
the curve in at least three coincident points. This means that the third order term 
of the Taylor-Lagrange development must vanish too. 

It will be useful to compute the infiexion points. For this purpose we define 
the Hessian curve 'H(C) of C, which is given by the determinantal equation: 



I dxidxj 



1=0 
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It can be proven (see m) that the points where a curve C meets its Hessian 
curve 'H(C) are exactly the inflexion points and the singular points. Since the 
degree of "H(C) is 3(n — 2), there are 3n(n — 2) inflexion and singular points 
counting with the corresponding intersection multiplicities (Bezout’s theorem, 

see ETI l. 

2.2 Introductory Properties 

In this section, we are interested in providing a few general properties of two 
images of the same planar algebraic curve. First, note that the condition that 
the plane of the curve in space does not pass through the camera centers is 
equivalent to the fact that the curves in the image planes do not collapse to 
lines and are projectively isomorphic to the curve in space. Furthermore, the 
homography matrix induced by the plane of the curve in space is regular. 

Proposition 1 Homography mapping 

Let a be the mapping from the first image to the second image, that sends p to 
Ap. Let f{x, y,z) = Q (respectively g{x, y, z) = 0) be the equation of the curve C 
(respectively C ) in the first (respectively second) image. We have the following 
constraint on the homography A; 

3A, yx,y,z, goa{x,y,z) = Xf{x,y,z) 

Proof: Since the curve C and C' are corresponding by the homography A, the 
two irreducible polynomials g o a and / define the same curve C. Thus these 
polynomials must be equal up to a scale factor (see previous subsection). 



Proposition 2 Tangency conservation 

Let J be the set of the epipolar lines in the first image that are tangent to 
the curve C, and let be J' the set of epipolar lines in the second image that are 
tangent to the curve C . The elements of ff and J' are in correspondence through 
the homography A induced by the plane of the curve in space. 

Proof: Let / (respectively g) be the irreducible polynomial that deflnes C (respec- 
tively C'). Let a be the mapping from the first image plane to the second image 
plane, that takes a point p and sends it to Ap. According to the previous pro- 
position, the two polynomials / and go a are equal up to scale y. Let e and e' be 
the two epipoles. Let p a point located on C. The line joining e and p, is tangent 
to C at p if A = 0 is a double root of the equation: /(p + Ae) = 0. (If e is located 
on C, we invert p and e.) This is equivalent to say that V/(p).e = 0. Since 
V5(p')-e' = V 5 (Ap).Ae = dg{a{p)) o da(p).e = d{g o a)(p).e = ydf{Tp).e = 
yV f(p).e = 0. Therefore it is equivalent to the tangency of the line e' A Ap with 
C . Given a line 1 G J, its corresponding line 1' G J' is given by: A^^l = 1AE[] 

^ By duality A^ sends the lines of the second image plane into the lines of the first 
image plane. Here we have showed that A^ induces to one-to-one correspondence 
between ff' and fj . 
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Note that since epipolar lines are transformed in the same way through any 
homography, the two sets and J' are in fact projectively related by any 
homography. Some authors have already observed a similar property for apparent 
contours (see P and P). 

Proposition 3 Inflexions and singularities conservation 

The inflexions (respectively the singularities) of the two image curves are pro- 
jectively related by the homography through the plane of the curve in space. 

Proof: This double property is implied by the simple relations (we use the same 
notations than in the previous proposition): 



I(p) 

l(p) 




■|(AP)' 


= 


|(AP) 


f(p) 




f§(Ap) 



[eSi7(P)l = A^I<si7(Ap)lA 

The first relations implies the conservation of the singularities by homography, 
whereas the second relation implies the conservation of the whole Hessian curve 
by homography. 



3 Recovering the Epipolar Geometry from Curve 
Correspondences 

3.1 Prom Conic Correspondences 

Let C (respectively C') be the full rank (symmetric) matrix of the conic in 
the first (respectively second) image. The equations of the dual curves are 
4>{u,v,w) = l’^C*l = 0 and il}{u,v,w) = 1^C'*1 = 0 where 1 = \u,v,wY', 
C* = det{C)C~^ and C'* = det{C')C'~^ . C* and C'* are the adjoint matrices 
of C and C' (see jlE|)- 



Theorem 1 The fundamental matrix, the first epipole and the conic matrices 
are linked by the following relation: 

3A 7 ^ 0, such as: F^C'*F = A[e]„C*[e],,, (1) 

where [eja, is the matrix that represents the linear map p i — > e A p. 

Proof: According to propositionQ both sides of the equation are in fact the two 
tangents of the conic C, passing the epipole e. Each tangent appears at the first 
order in both expression. Therefore they are equal up to a non-zero scale factor. 

It is worthwhile noting that these equations are identical to Kruppa’s equa- 
tions m which were introduced in the context of self-calibration. 
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From equation ^ one can extract a set, denoted £\, of six equations on F, 
e and an auxiliary unknown A. By eliminating A it is possible to get five bi- 
homogeneous equations on F and e. 

Theorem 2 The six equations, £\, are algebraically independent. 

Proof: Using the following isomorphic mapping: (F, e, A) i — (D'*FD~^, De, A) 
= (X,y, A), where D = ^/C and D'* = \/C'*, in the field of complex, the origi- 
nal equations are mapped into the upper-triangle of X^X = A[y]^. Given this 
simplified form, it is possible of compute a Groebner basis (0, 0). Then we can 
compute the dimension of the affine variety in the variables (X,y,A), defined 
by these six equations. The dimension is 7, which shows that the equations are 
algebraically independent. 

Note that the equations £\ imply that Fe = 0 (one can easily deduce it from 
the equation QH) . In order to count the number of matching conics, in generic 
positions, that are necessary and sufficient to recover the epipolar geometry, we 
eliminate A from £\ and we get a set £ that defines a variety V of dimension 
7 in a 12-dimensional affine space, whose points are (F,e). The equations in 
£ are bi-homogeneous in F and e and V can also be regarded as a variety of 
dimension 5 into the bi-projective space P® x where (F,e) lie. Now we 
project V into P®, by eliminating e from the equation, we get a new variety 
V/ which is still of dimension 5 and which is contained into the variety defined 
by det{F) = 0, whose dimension is 7 0 . Therefore two pairs of matching conics 
in generic positions defines two varieties isomorphic to V/ which intersect in a 
three-dimensional variety (5-1-5 — 7 = 3). A third conic in generic position will 
reduce the intersection to a one-dimensional variety (5 -I- 3 — 7 = 1). A fourth 
conic will reduce the system to a zero-dimensional variety. These results can be 
compiled into the following theorem: 

Theorem 3 {Four conics} or {three conics and a point} or 

{one conic and five points} in generic positions are sufficient to compute the 

epipolar geometry. 

We conclude this section by notifying that this dimensional result is valid 
under the assumption of complex varieties. Since we are interested in real soluti- 
ons only, degeneracies might occur in very special cases such that then less than 
four conics might be sufficient to recover the epipolar geometry. 

3.2 From Higher Order Curve Correspondences 

Assume we have a projection of an n’th, n > 3, algebraic curve. We will show 
next that a single matching pair of curves are sufficient for uniquely recovering 

^ It is clear that we have: F^C'*Fe = 0. For any matrix M, we have: fcer(M^) = 
7m(M)^. In addition, C' is invertible. Hence Fe = 0 
® Since it must be contained into the projection to F® of the hypersurface defined by 
det(Fe) = 0 
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the homography matrix induced by the plane of the curve in space, whereas two 
pairs of matching curves (residing on distinct planes) are sufficient for recovering 
the fundamental matrix. 

Let (respectively be the tensor form of the first (second) 

image curve. Let A* be the tensor form of the homography matrix. 



3A ^ 0, such as: 



( 2 ) 



Since a planar algebraic curve of order n is represented by a polynomial 
containing |n(n + 3) + l terms, we are provided with ^n(n + 3) equations (after 
elimination of A) on the entries of the homography matrix. Let S denote this 
system. Therefore two curves of order n > 3 are in principle sufficient to recover 
the epipolar geometry. However we show a more geometric and more convenient 
way to extract the homography matrix since the system S might be very difficult 
to solve. 

The simpler algorithm is true for non-oversingular curves, e.g. when a tech- 
nical condition about the singularities of the curve holds. In order to make this 
condition explicit, we define a node to be an ordinary double point that is a 
double point with two distinct tangents, and a cusp to be a double point with 
coincident tangents. A curve of order n, whose only singular points are either 
nodes or cusps, satisfy the Plucker’s formula (see EH]): 



3n{n — 2) = i + 6xS + 8xK, 



where i is the number of inflexion points, S is the number of nodes, and k is the 
number of cusps. For our purpose, a curve is said to be non-oversingular when 
its only singularities are nodes and cusps and when i -|- s > 4, where s is the 
number of all singular points. 

Since the inflexion and singular points in both images are projectively related 
through the homography matrix (proposition EJ, one can compute the homogra- 
phy through the plane of the curve in space of a curve of order n > 3, provided 
the previous condition holds. The resulting algorithm is as follows: 

1. Compute the Hessian curves in both images. 

2. Compute the intersection of the curve with its Hessian in both images. The 
output is the set of inflexion and singular points. 

3. Discriminate between inflexion and singular points by the additional con- 
straint for each singular point a: V/(a) = 0. 

At first sight, there are il x s! possible correspondences between the sets 
of inflexion and singular points in the two images. But it is possible to further 
reduce the combinatorics by separating the points into two categories. The points 
are normalized such that the last coordinates is 1 or 0. Then separate real points 
from complex points. Each category of the first image must be matched with the 
same category in the second image. Then the right solution can be selected as 
it should be the one that makes the system S the closest to zero or the one that 
minimizes the Hausdorff distance (see fl]) between the set of points from the 
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second image curve and the reprojection of the set of points from the first image 
curve into the second image. For better results, one can compute the Hausdorff 
distance on inflexion and singular points separately, within each category. We 
summarize this result: 

Theorem 4 The projections of a single planar algebraic curve of order n > 3 are 
sufficient for a unique solution for the homography matrix induced by the plane 
of the curve. The projections of two such curves, residing on distinct planes, are 
sufficient for a unique solution to the multi-view tensors (in particular to the 
fundamental matrix). 

It is worth noting that the reason why the fundamental matrix can be re- 
covered from two pairs of curve matches is simply due to the fact that two 
homography matrices provide a sufficient set of linear equations for the funda- 
mental matrix: if Ai, i = 1,2, are two homography matrices induced by planes 
7Ti, 7T2, then AJ F -\- Ai = 0 because Afl F is a symmetric matrix. 

In the previous section, the computation of the epipolar geometry was inten- 
ded using an equivalent to Kruppa’s equations for any conic. It is of theoretical 
interest to investigate the question of possible generalization of Kruppa’s equati- 
ons to higher order curves. To this intent, let f (respectively ip) be the dual curve 
in the first (respectively second) image. Let 7 (respectively f) be the mapping 
sending a point p from the first image into its epipolar e A p (respectively Fp) 
in the first (respectively second) image. Then the theorem 0 holds in the general 
case, and can be regarded as an extended version of Kruppa’s equation: 

Theorem 5 The dual curves in both images are linked by the following expres- 
sion: 

3A 0, such as: ip o = X(p o ^ (3) 

Proof: According to their geometric interpretation, the sets defined by each 
side of this equation are identical. It is in fact the set of tangents to the first 
image curve, passing through the first epipole. It is left to show that each 
tangent appears with the same multiplicity in each representation. It is ea- 
sily checked by a short computation, where A is the homography matrix bet- 
ween the two images, through the plane of the curve in space and a(p) = Ap: 
4’ o ^(p) = A Ap) = ip{a{e) A a(p)) = ip o (*a)“^(p).u Then it is sufficient 
to see that the dual formulation of the property Q is written hj ip o {fa)~^ = <p. 



4 3D Reconstruction 

We turn our attention to the problem of reconstructing a planar algebraic curve 
from two views. Let the camera projection matrices be [I; 0] and [H;e']. We 
propose two simple algorithms. 



Indeed for a regular matrix A: Ax A Ay = det{A)A ^(x A y). Then since ip is a 
form, the last eqnality is true up to the scale factor, 
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4.1 Algebraic Approach 



In this approach we first recover the homography matrix induced by the plane of 
the curve in space. This approach reduces the problem of finding the roots of uni- 
variate polynomials. The approach is inspired by the technique of recovering the 
homography matrix in m- It is known that any homography can be written as: 
A = H-|-e'a^ (see [2f)ll7| i. Using the equation |21 we get the following relation: 






p'U 






where Ae = /re'. 

Note that in fact this equation can be also obtained in a pure geometric way, 
by saying that the polar curve with respect to the epipole is conser- 
ved by homography. Let and L 

Therefore we get: -|- e''‘^aj) = Thus the vector a can be ex- 
pressed as a function of ?/ = „^_i , (provided the epipoles are not located on 

the image curves): aj = ~ where /3 = j^e'*L...e'*'‘. Then 

Aj = W- + j^(r]Aje'^ — A'j,H^e''‘). Substituting this expression of A into the 
equation El and eliminating A leads to a set of equations of degree n on ry. We 
are looking for the real common solutions. 

In the conic case, there will in general two distinct real solutions for r] cor- 
responding to the two planar curves that might have produced the images. For 
higher order curve, the situation may be more complicated. 



4.2 Geometric Approach 

This following approach highlights the geometric meaning of the reconstruction 
problem. The reconstruction is done in three steps: 

1. Compute the cones generated by the camera centers and the image curves, 
whose equations are denoted F and G. 

2. Compute the plane of the curve in space. 

3. Compute the intersection of the plane with one of the cones. 

The three steps are detailed below. 

Computing the cones equations. 

For a general camera, let M be the camera matrix. Let r be the projection 
mapping from 3D to 2D: r(P) = MP. Let /(p) = 0 be the equation of the image 
curve. Since a point of the cone is characterized by the fact that /(r(P)) = 0, 
the cone equation is simply: /(r(P)) = 0. Here we have; F(P) = /([I;0]P) and 
G(P) = /([H;e']P). 

Computing the plane of the curve in space. 
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Theorem 6 The plane equation 7r(P) = 0 satisfies the following constraint. 
There exists a scalar k and a polynomial r, such that: r x = F + kG. 



Proof: F and G can be regarded as regular functions on the plane. Since they are 
irreducible polynomials and vanish on the plane on the same irreducible curve 
and nowhere else, they must be equal up to a scalar in the coordinate ring of 
the plane, e.g. they are equal up to a scalar modulo tt. 



Let 7 t(P) = axX+fxY+"fxZ+SxT, where P = [X, Y, Y,T]'^ . The theorem 
3 “I” Tl \ 

(which is the number of terms in a polynomial that defines 



provides 



a surface of order n in the three-dimensional projective space) equations on 

3 -I- n — 1 ' 
n — 1 

of r. One can eliminate the auxiliary unknowns k, {rf, using Groebner basis 0, 
0 or resultant systems [331, US- Therefore we get lj2n{n + 3) equations on 



fc, a, /3, 7, 5, (ri)i<i<s, where s = 



and the {ri)i are the coefficients 



However, a more explicit way to perform this elimination follows. Let S be 
the surface, whose equation is E = F + kG = 0. The points P that lie on 
the plane tt are characterized by the fact that when regarded as points of S, 
their tangent planes are exactly tt. This is expressed by the following system of 
equations: 



r ^(p) = 0 
I (/3|f -af)(P) = 0 

(7p-«ff)(P)=0- 

On the other hand, on the plane 77(P) = F(P) + fcG(P) = 0. Therefore 
k = — for any P on the plane that is not located on the curve itself. 
Therefore we get the following system: 

r 7 t(p) = 0 

I (/3(G|| - Fm - a((G|| - Pg))(P) = 0 
1 (7(Gg - F§) - a{{G§ - F§)){P) = 0 
[ {d{G§F _ ) - a((Gff - F§)){P) = 0 

Since the plane we are looking for doesn’t pass through the point [0, 0, 0, 1]^ 
which is the first camera center, S can be normalized to 1. Thus for a point P on 
the plane, we have: T = —{aX + (3Y + ^Z). By substituting this expression of 
T into the previous system, we get a new system that vanishes over all values of 
(7f, Y,Z). Therefore its coefficients must be zero. This provides us with a large 
set of equations on (a, /3, 7), that can be used to refine the solution obtained by 
the algebraic approach. 

Computing the intersection of the plane and one of the cones. 
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The equation of the curve on the plane U is given by the elimination of T 
between the two equations: aX + f3Y + + T = 0 and /(t(P)) = 0. Using 

the first cone gives us immediately the equation, since the first camera matrix 
is [I;0], 



5 Experiments 



5.1 Computing the Epipolar Geometry: 3 Conics and 2 Points 



In order to demonstrate the validity of the theoretical analysis, we compute the 
fundamental matrix from 3 conics and 2 points in a synthetic experiment. The 
computation is too intense for the standard computer algebra packages. We have 
found that Fast Gb 0 a powerful program for Groebner basis, introduced by 
J.C. Faugere [Dll 0] is one of the few packages that can handle this kind of com- 
putation. The conics in the first image are: 
fl{x,y,z) = x^ + y'^ + 
gUx, y,z) = + y'^ + 81 

hl{x, y,z) = {A:X -\- y) X + {x — 1/2 z)y -\- [—l/2y z) z 
The conics in the second image are: 

f2{x,y, z) = - ioo(_4^^)2 (-1900a;^ -k SOOa;^^ - 130 V + + 9820t/z^- 

immyz - 72700z^ + 40000z^V3) 

V, «) = 400(44891400^)^ (33036473600a:" + 5732960000o:2y3 + 332999600*2/73- 
214463200*1/ - 73852000*2 - 1384952000*273 -k 9091399981?/^ -k 1771266080?/^ 73- 
16090386780?/273 -k 10160177600?/2 -k 5564962423002^ + 1415825920002^73) 
h2(x,y,z) = - 400 (- 56 i+ 38 Y 3)2 (-519504000*^ + 483117002^- 
125749120*?/73 -k 43249920*?/ - 254646400*273 - 6553140?/273 + 56456040?/2-k 
68848000*^75 -k 1279651200*2 - 2722674002^75 -k 2522418775 - 2982097) 

Given just the constraints deduced from the conics, the system defines, as 
expected, a one-dimensional variety in P® x P^. When just one point is intro- 
duced, we get a zero-dimensional variety, whose degree is 516. When two points 
are introduced, the system reduces to the following: 

' P[l, 1] = F[2, 2] = F[2, 3] = F[3, 2] = F[3, 3] = 0 
P[3,1] + (73-1)P[1,3] = 0 
< 10P[2,l]-k(73-l)P[l,3] = 0 

10P[l,2]-k(73-2)P[l,3] = 0 
133813 * F[l, 3]2 - 20600 * - 51100 = 0 



Then it is easy to get the right answer for the fundamental matrix: 



0 



—2+V3 1 

7511-206 ^3 7511-206 ^3 



-l-k\/3 Q 

7511 - 206^3 

-10 . 0 

7511-206 ^3 



0 

0 



https: / /fgb. medicis.polytechnique.fr/ 
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5.2 Computing the Homography Matrix 

We have performed a real image test on recovering the homography matrix 
induced by the plane of a 3’rd order curve. The equations of the image curves 
were recovered by least-squares fitting. Once the homography was recovered we 
used it to map the curve in one image onto its matching curve in the other 
image and measure the geometric distance (error) . The error is at subpixel level 
which is a good sign to the practical value of our approach. Figure Q displays 
the results. 




Fig. 1. The first and the second image cubic. 




Fig. 2. The reprojected curve is overlayed on the second image cubic. A zoom shows 
the very slight difference. 
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5.3 3D Reconstruction 

Given two images of the same curve of order 4 (figure 0 and the epipolar geo- 
metry, we start by compute the plane and the homography matrix, using the 
algebraic approach to reconstruction. There are three solutions, that are all very 
robust. However to get further precision, one can refine it with the final system 
on the plane parameters, obtained at the end of the geometric approach. To 
demonstrate the accuracy of the algorithm, the reprojection of the curve in the 
second image is showed in the figure 0 The 3D rendering of the correct solution 
and the three solutions plotted together are showed in figure 0 




Fig. 3. The curves of order 4 as an input of the reconstruction algorithm. 




Fig. 4. Reprojection of the curve onto the second image. 
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Fig. 5. The curves of order 4 as an input of the reconstruction algorithm. 



Finally, the equation of the correct solution on its plane is given by: f{x, y, z) = 

9006922504387547 4 _ 4947731105035649 3 , 1070847909255857 22 _ 

9007199254740992 ^ 1152921504606846976 147573952589676412928 ^ ^ 

5458927196207623 , 3 , 3969428158337415 4 _ 7563069091264439 ^ 3 , 

1208925819614629174706176 V ^ 2475880078570760549798248448 ^ 1152921504606846976 

5911661048544087 2 7447102119819593 j 3625625302714855 i 

295147905179352825856 302231454903657293676544 ^“'"618970019642690137449562112 

4936178943362411 2 2 8944822903795571 7158022235457567 2 2 

295147905179352825856 ^ 302231454903657293676544 ^'^"''309485009821345068724781056 ^ 

6146225343803339 3, 7423176283805271 6539339092801811 4 

302231454903657293676544 "''618970019642690137449562112 ^"''618970019642690137449562112 

The curve is drawn on figure 6. 




Fig. 6. The original curve. 



6 Conclusion and Future Work 



We have presented simple closed-form solutions for recovery of homography ma- 
trix from a single matching pair of curves of n > 3 order arising from a planar 
curve; two algorithms for reconstructing algebraic curves from their projections, 
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again in closed-form; and revisited the problem of recovering the fundamental 
matrix from matching pairs of conics and proposed an analytic proof to the 
findings of uni that four matching pairs are necessary for a unique solution. 

Our experiments on real imagery demonstrate a sub-pixel performance level 
— an evidence to the practical value of our algorithms. Future work will investi- 
gate the same fundamental questions — calibration and reconstruction — from 
general three-dimensional curves. 

Acknowledgment We express our gratitude to Jean-Char les Faugere for 
giving us access to his powerful system FGb. 
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7 Appendix 

7.1 Tensor Representation of a Planar Algebraic Curve 

As a conic admits a matrix representation, p^Cp = 0 iff p belongs to the conic, a 
general algebraic curve of order n admits a tensor representation: ...p*" 

= 0 iff p belongs to the curve, where for each k, ik C 1, 2, 3. In this tensor repre- 
sentation, a short notation is used: a repeated index on low and high position 
is summed over its domain definition. One has to link this tensor representation 
with the regular polynomial representation: f{x,y,z) = 0. 
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Lemma 1. Let f{x,y,z) = Q he an homogeneous equation of order n. There 
exists a tensor of order n, defined up to a scale factor, such as the equation can 
he rewritten in the following form: 

= 0 , 

where for each k, ik G 1, 2, 3, p = \x, y, and: 

, for each t which is a transposition o/ {1, 2, n}. 0 
Proof: The proof is quite forward. It is just necessary to remark that for each n- 
uplet ii,..,in such as: > Z2 > ... > in, we have: Q^ayh°c f(x,y, 

z), where: a = ^ = Y^ik=2 c = J2i^=3 ^ ^ The factor a 

is due to the symmetry of the tensor. Finally Ti^ i^ = ^ dx^yh,,c f{x, y, z). 



A transposition of {l,2,..,n} is defined by the choice of a pair {i,j} C {1,2, ..n}, 
such as: i j and r(fc) = k for each k G (1, 2, ..n}\{i, j} and r(i) = j and r{j) = i. 
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Abstract. Condensation is a popular algorithm for sequential infe- 
rence that resamples a sampled representation of the posterior. The al- 
gorithm is known to be asymptotically correct as the number of samples 
tends to infinity. However, the resampling phase involves a loss of infor- 
mation. The sequence of representations produced by the algorithm is a 
Markov chain, which is usually inhomogeneous. We show simple discrete 
examples where this chain is homogeneous and has absorbing states. In 
these examples, the representation moves to one of these states in time 
apparently linear in the number of samples and remains there. This phe- 
nomenon appears in the continuous case as well, where the algorithm 
tends to produce “clumpy” representations. In practice, this means that 
different runs of a tracker on the same data can give very different an- 
swers, while a particular run of the tracker will look stable. Furthermore, 
the state of the tracker can collapse to a single peak — which has non-zero 
probability of being the wrong peak — within time linear in the number 
of samples, and the tracker can appear to be following tight peaks in the 
posterior even in the absence of any meaningful measurement. This me- 
ans that, if theoretical lower bounds on the number of samples are not 
available, experiments must be very carefully designed to avoid these ef- 
fects. 

1 Introduction 

The Bayesian philosophy is that all information about a model is captured by a 
posterior distribution obtained using Bayes’ rule: 

posterior = P(world| observations) oc P(observations|world)P( world) 

where the prior P( world) is the probability density of the state of the world in 
the absence of observations. Many examples suggest that, when computational 
difficulties can be sidestepped, the Bayesian philosophy leads to excellent and 
effective use of data (e.g. m The technique has been widely used in vision. 

Obtaining some representation of the world from a posterior is often referred 
to as inference. One inference technique that is quite general is to represent the 
posterior by drawing a large number of samples from that distribution. These 
samples can then be used to estimate any expectation with respect to that 
posterior. 
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1.1 Condensation or “Survival of the Fittest” or Particle Filtering 

The most substantial impact of sampling algorithms in vision has been the use 
of resampling algorithms in tracking. The best known algorithm is known as 
condensation in the vision community survival of the fittest in the AI com- 
munity 0 and particle filtering in the control literature |4l9j . In Condensa- 
tion, one has a prior density p{x) on the state of the system, a process density 
p{xt\xt-i) , and an observation density p{z\x). The state density at time t condi- 
tioned on the observations till then, p{xt\Zt), is represented by a set of weighted 
samples n = 1, . . . , N}. To update this representation for time t+1, 

one draws N points from the samples s). , with replacement, with probability 
proportional to their weights 7 Tj”\ These points are perturbed in accordance 
with the process density to get the new samples The new weights 

are computed by evaluating p{z\x) for each of the new samples x in light of 
the new observation z. Condensation is fast and efficient, and is now quite 
widely applied (INSPEC produces 16 hits for the combination “condensation” 
and “computer vision” for the last 4 years). 

Condensation can represent multi-modal distributions with its weighted 
samples, and can hence maintain multiple hypotheses when there is clutter or 
when there are other objects mimicking the target object. It can run with boun- 
ded computational resources in near real-time by maintaining the same number 
N of samples at each step. 

There are asymptotic correctness results for Condensation essentially as- 
serting that, for a fixed number of frames T and desired precision, there is a 
number N of samples so that the sampled representation at time t approximates 
the true density at time t to within the desired precision for t = 1, 2, . . . ,T—1,T 
(e.g. P3). There is little information on how large N should b^ in P, examples 
are given of 800 frames tracked with 100 samples (p. 18) and 500 frames tracked 
with 1500 samples (p. 19). 

In what follows, we show that iterations of the Condensation algorithm 
form a Markov chain, whose state space is quantized representations of a density. 
We show strong evidence that this Markov chain has some unpleasant properties. 
The process of resampling tends to make samples collapse to a single cluster, 
putting substantial weight on “peaky” representations. When the true density 
is multimodal, even if the mean computed from this clumpy density is unbiased, 
the movement of the clump may be slow. In turn, this means that: 

— expectations computed with the representation maintained by Condensa- 
tion have high variance so that different runs of the tracker can lead to very 
different answers ^section 1,8.11 section ; 

^ “Note that convergence has not been proved to be uniform in t. For a given fixed 
t, there is convergence as M — >■ oo but nothing is said about the limit t — >■ oo. In 
practice this could mean that at later times t larger values of N may be required, 
though that could depend also on other factors such as the nature of the dynamical 
model.” P, p. 27 
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— expectations computed with the representation maintained by a particular 
instance of Condensation have low variance so that in a particular run of 
the tracker, it will look stable (sectioned! section R~TTl : 

— the state of the tracker may collapse to a single peak within time roughly 
linear in the number of samples (section 12.1 1 section 12. It section f2.2t sec- 
tion 

— the peak to which the state collapses may bear no relationship to the dynamic 
model (sections 12] and 

— and the tracker will appear to be following tight peaks in the posterior even 
in the absence of any meaningful measurement ('section I.S.2II . 

Some of these phenomena have been noticed as a practical matter in the particle 
filtering literature, where they are referred to as “sample impoverishment” P|, 
and others have well-understand analogs in population genetics, but we present 
the first explanation we are aware of in the context of motion tracking. 



1.2 Markov Chains 

A Markov chain is a sequence of random variables Xk with the property that 
P{Xn\Xi , . . . , Xn-i) = P{Xn\Xn-i)- One can think of this important property 
as “forgetting”; the distribution for the next state of the chain depends only on 
the current state and not on any other past state. The chain is referred to as a 
homogeneous Markov chain if P{Xn\Xn-i) is independent of n. 

If the random variables are discrete and have a countable state space, we can 
write a matrix V called the state transition matrix whose i, j’th element is 

Pij — P^Xyi = j\Xji—\ = l) 

Notice that pij > 0 and 

'^Pik = 1 
k 

because the entries of V are probabilities. This matrix describes the state tran- 
sition process. In particular, assume that the random variable Xn~i has proba- 
bility distribution f. Then the random variable Xn has probability distribution 
g, where 



gT = Fp 

Now if the Markov chain is homogeneous, and Xi has probability distribution 
hi, then Xn has probability distribution h„ where 



K = hi_,v = iK_^v)v = hir 



T-r>n-l 



If the random variables are defined on a continuous domain D, and have 
probability density functions, then we can construct an operator analogous to 
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the state transition matrix. In particular, we consider a function V on D x D 
with the property that 

V{u, v)dudv = P{Xn € [v,v + dv]\Xn_i € [tt, t6 + du]) 

Notice that P(u,v) > 0 for all u, v and that 



/ T{u, v)dv = 1 
J D 



Now if Xn-i is a random variable with probability distribution function 
p{Xn-i), then the probability distribution function of X„ is 



/ P{u,v)p{u)du 
Jd 

A particularly interesting case occurs when the distribution of A„ and that 
of Xn-i are the same. This distribution is known as a stationary distribu- 
tion. Markov chains on both discrete and continuous spaces can have stationary 
distributions. On a discrete space, a stationary distribution p has the property 
that 



p'^r 



and on a continuous space, a stationary distribution p{u) has the property that 



/ P{u,v)p{u)du = p{v) 

Jd 

A Markov chain is not guaranteed to have a unique stationary distribution. 
In particular, there may be states, known as absorbing states, that the chain 
cannot leave. In this case, for each absorbing state the chain has a stationary 
distribution that places all weight on that absorbing state. 



2 Sample Impoverishment in the Discrete Case 

If the process density in Condensation is such that the samples aren’t pertur- 
bed after the resampling step, then the state space is effectively discrete, since 
no points which weren’t in the original batch of samples will ever be introduced. 
We can regard the state space as a set of bins, and the samples as weighted 
balls placed in these bins. Each stage of the inference process (tracking, in most 
applications) moves these balls from bin to bin, then re-weights the balls based 
on the new observation. The probability that a ball will go into a particular bin 
at time /c -I- 1 is proportional to the combined weight of the balls in that bin at 
time k. As a result, once a bin is empty, it can never again contain a ball. Once 
all the balls lie in a particular bin (which is guaranteed to happen with non-zero 
probability), the representation is stuck in this state. 
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2.1 A Two- State Problem 

Assume that we have a state space that consists of two distinct points. This 
could arise in tracking from the situation in figure H There is a stationary object 
{p{xt\xt-i) = Sxt_i(xt)) at x=l, but due to the mirrors it appears on the image 
plane at positions -1 and 1. Similarly, if the object were at x=-l, it would appear 
on the image plane at positions -1 and 1. Thus there is no way to disambiguate 
the positions -1 and 1 (or more generally -x and x) based on observations. We 
can model this by making one observation z at each stage at which we find, 
exactly, either the object or its reflection, and using p{z\x) = .b5x{z) + .b6-x{z). 



Fig. 1. Geometry for the model of section ll. II A stationary object is reflected in a 
mirror, and is “tracked” — the tracker receives an exact measurement of the position 
of either the point or its reflection with equal probability. In that section, we show that 
a Condensation tracker will lose track of either this object or its mirror reflection in 
a time linear in the number of samples maintained. 

Start with a prior density po{x) = .5^i(a;) + .5^_i(a;), indicating the point is 
equally like to be at positions -1 and 1. We should then have p(xt\Zt) = po{x) 
for each t because the object is stationary and each observation zt is equally 
consistent with the object being at -1 and 1. 

We now apply a Condensation tracker to estimate p{xt\Zt). Let N be the 
number of samples used, and assume for convenience that N is even. First note 
that given our prior density and our process density, each of the sample points 
will be either -1 or 1. Given our observation density, each of these points 
will have the same weight, = 1/A. Thus we can characterize the sampled 
representation at time t by a single number, Aj, which is the number of samples 
that are at 1 (the remaining N—Xt samples are at —1). Xt represents the density 




4 ^^ 



-10 1 



in age plane 




so the sampled representation accurately represents the true density if Xt = A/2 
for all t. 
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Xt is a random variable, and the sequence Xi,X 2 ,... is a homogeneous 
Markov chain (because Pr{Xt\Xi, . . . ,Xt-i) — Pr{Xt\Xt-i)). This Markov 
chain has the 7V+ 1 states 0, . . . ,N and, as the samples at time t are constructed 
by drawing N points from the samples at time t — 1, with replacement and with 
equal weight, we have 



Pr{Xt = j\Xt-i = i) 





/N-i 

[~W~ 



N-j 



Losing track of an item The dynamics of the Condensation algorithm 
guarantee that, with probability one, the sequence of Xt’s for our example will 
eventually be the constant sequence 0 or the constant sequence N. The states 
0 and N are absorbing. Since P{i,0) + P{i,N) > (1/2)-^ for all i, the expected 
number of steps before the chain is absorbed in one of these states is easily seen 
to be less than 2^ . This is, however, a gross overestimate: the expected absorb 
time is of the order N steps, as shall be addressed presently. 



The time to lose track The matrix V is also the transition matrix for the 
Wright-Fisher model of population genetics, and has been investigated since 
the 1920’s. The samples at -1 and 1 correspond to two kinds of alleles in a 
population with fixed size, random mating, and non-overlapping generations. In 
this context the expected absorb time corresponds to the number of generations 
until one allele is lost entirely from the population as a result of “genetic drift. 

Let the vector w correspond to having half the samples in each mode. In |n|, 
it is shown that the iV -|- I eigenvalues of V are 1, 1, {N — l)/iV, {N — 1){N — 
2)/iV^, . . . , {N — Thus V has a basis of eigenvectors so we can write 

W = 7 T -k C2A2V2 H h c„A„v„ 

^ “If a population is finite in size (as all populations are) and if a given pair of parents 
have only a small number of offspring, then even in the absence of all selective 
forces, the frequency of a gene will not be exactly reproduced in the next generation 
because of sampling error. If in a population of 1000 individuals the frequency of 
“a” is 0.5 in one generation, then it may by chance be 0.493 or 0.0505 in the next 
generation because of the chance production of a few more or less progeny of each 
genotype. In the second generation, there is another sampling error based on the 
new gene frequency, so the frequency of “a” may go from 0.0505 to 0.501 or back 
to 0.498. This process of random fluctuation continues generation after generation, 
with no force pushing the frequency back to its initial state because the population 
has no “genetic memory” of its state many generations ago. Each generation is an 
independent event. The final result of this random change in allele frequency is 
that the population eventually drifts to p=l or p=0. After this point, no further 
change is possible; the population has become homozygous. A different population, 
isolated from the first, also undergoes this random genetic drift, but it may become 
homozygous for allele “A” , whereas the first population has become homozygous for 
allele “a”. As time goes on, isolated populations diverge from each other, each losing 
heterozygosity. The variation originally present within populations now appears as 
variation between populations.” cn, p. 704 
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where Xi are eigenvalues of the transition matrix V, are eigenvectors of that 
matrix, and tt is a stationary distribution corresponding to a superposition of 
absorbing states. Now after k transitions of the chain, the probability distribution 
on the state is 



= 7 T + C2A2V2 H h CnX%VN 



so 






where 



C = IC2IIV2I H h |Civ||vAr| 



So, we have that after k steps 

where tt is a stationary distribution (all balls in one bin) with eigenvalue one. 
Thus the second eigenvalue determines the asymptotic rate with which the pro- 
bability that the chain has been absorbed approaches 1, and also the rate at 
which the variance of the representation that Condensation reports collapses, 
as figure 0 illustrates. 




Fig. 2. On the left, the probability that all N samples belong to the same mode after 
k steps, graphed against k, assuming that the chain is started with N/2 samples from 
each mode. There are two curves, one for N — 20 and one for N = 40. On the right, 
a graph showing the variance of the posterior density estimated by Condensation 
from the N samples after k steps. The correct variance at any stage is known to be 1. 
The estimated variance goes down, because the samples collapse to one mode. However, 
comparing these estimates from step to step would suggest that the estimate was good. 
There are two curves, one for N — 20 and one for N = 40. All graphs were computed 
numerically from powers of V. 
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Fig. 3. We have that — 7r| < C{{N — 1)/^^)*^, where tt is a stable distribntion 

(all balls in one bin) with eigenvalue one. To obtain a useful finite time bound we need 
to know something about C . Above is a graph of C as a function of N , computed 
numerically; the graph suggests that C grows no faster than linearly in N. 



While A 2 tells us something about the asymptotic convergence to tt, for this 
to be a useful finite time bound we need to know something about C. Figure 0 
shows a graph of (7 as a function of N, computed numerically. If, as the figure 
suggests, C is a sublinear function of N, then since (1 — 1/N)^ < 1/e 



\w^V^aN) _ ^1 < 1 jv < 

N e 

This means that an arbitrarily small probability that the chain is out of an 
absorbing state can be obtained in a number of steps of order N log C where 
C < N. Direct computations suggest that the factor of log C can be dispensed 
with entirely (figure 0 ). 

A more natural measure of convergence may be the expected number of steps 
to reach an absorbing state. Although an exact formula for finite N remains 
elusive, it is known (e.g., 0) by using a continuous time approximation that 
as the number of samples N goes to infinity, the expected time to be absorbed 
when starting with j samples at 1 and N — j samples at -1 is asymptotically 



^ r N — i , ,N — i. i , 



In particular, when starting with N /2 samples at -1 and at 1, the expected 
time to be absorbed is asymptotically (21n2)A^ Ri 1.4A^. This is good appro- 
ximation of the expected absorb time for small N as well, as demonstrated in 
figure 0 

Note that while this linear absorb time may not overly concern population 
geneticists, for whom the the time between the generations of interest may be 
many years, in Condensation there are perhaps 30 “generations” per second 
— one for each frame of video — so with a hundred samples, modes may be lost 
after only a few seconds. 
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Fig. 4. The time to lose a mode is roughly linear in the number of samples. On the left, 
a graph of the number of steps required for the probability that all samples belong to 
the same mode to exceed 50%, 90%, and 99%, as a function of the number of samples 
N, obtained by evaluating On the right, a graph of the expected absorb time for 

= 2, 4, . . . ,50 samples when starting with half the samples in each mode, computed 
numerically. The solid line is the asymptotic expected absorb time (21n2)A’. 



2.2 Multiple States 

We can consider a similar process with three, four, or in general b bins, instead 
of two. With b bins and N samples, the number of states in the Markov chain 
is corresponding to the number of distinct ^-tuples (ii,... ,%) with 

nonnegative integer coordinates summing to N. The transition probability is 

The eigenvalues of the above P are 

l,(iV- l)/iV, (iV- l)(iV-2)/iV^... ,(iV- 




with multiplicities 



b, 





b + N -2 
N 



respectively. 

Since these are the same eigenvalues as for the two-bin case (though with 
different multiplicities) we may expect the same qualitative asymptotic behavior. 
Simulation results bear out this view (figure 0 . 

The expected time for all the samples to collapse into a single bin when 
starting with samples from the Tth bin is asymptotically 



-2^{E 



W-.Y,, ,N-X,„ 



N 
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In particular, starting with N /b samples in each bin, the mean absorb time is 

2iV(6-l)ln(5^). 

For proofs, and further analysis of the multi-bin case, see [^. 




Fig. 5. When repeatedly resampling from a density with h discrete modes, the number 
of modes represented by the samples is nonincreasing, and the representation moves 
rapidly to an absorbing state where all samples lie in a single mode . The figure shows 
results from a chain using twelve samples, obtained by computing powers of V. The 
graph shows the probability that all twelve samples are from the same mode as a 
function of the number of transitions k, after starting with an equal number of samples 
from each mode. The three curves, from left to right, are for 2, 3, and 4 modes. 



If we start with an equal number of samples from each mode, considerations 
of symmetry dictate an equal probability of all the samples ending up in either 
mode; more is possible: As E{Xk+i\X):) = E{Xk), the sequence of Xk’s forms a 
martingale. Then by the optional stopping theorem cni, the probability that all 
the samples end up in a given mode is proportional to the number of samples 
that started in that mode. An analogous result holds when there are more than 
two modes, as can be seen by considering one mode in opposition with the rest 
of the modes combined. A consequence of this fact is that if a spurious mode can 
start with non-negligible fraction of the samples, it has a similarly non-negligible 
probability of usurping all the samples and suppressing the true mode. 

3 Bad Behaviour in Continuous Spaces 

The above examples were discrete to make the analysis more straightforward, 
but the same phenomena are visible in the continuous case. Slightly perturbing 
the samples will make them distinct, so that the Markov chain will no longer 
have an absorbing state, and the observations may vary, so that the chain will 
no longer be homogeneous. However, there will still be bad behaviour. 

Recall that the chain’s domain is sampled representations of probability den- 
sities. If a set of samples are clustered together closely, resampling these samples 
will tend to produce a cluster that is near the original cluster. 
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Two phenomena appear: Firstly, modes are lost ('section I.S.1 ll . as in the di- 
screte example; secondly, the algorithm can produce data that strongly suggest 
a mode is being tracked, even though this isn’t happening (section IT 21) . We 
show two examples that illustrate the effects. The examples are on a compact 
domain; all gaussian process densities are windowed to have support extending 
2 standard deviations to either side of the mode, and particles are reflected off 
boundaries at -2 and 2 (-1 and 1 for the uniform observation density examples 
of section H2|) 

3.1 Losing Modes while Tracking Almost Stationary Points 

We simulated a system with gaussian diffusion, in the same setup as in figure Q 
We used the process density 

P{xt\xt-i) oc exp(-(a:t - a;t_i)^/2(.05)^) 
and observation density 

P{z\x) oc exp(— (a; — z)‘^/2a‘^) + exp(— (a; -I- z)^/2cr^) 

The observation density represents a slightly defocussed observation which still 
anticipates finding the object and its mirror image. Various values of the para- 
meter a were employed, including a = oo, in which the observation density is 
uniform and observations are consequently completely uninformative. 

The motion of an object starting at a;o = 1 and undergoing a gaussian random 
walk was simulated; defocussed observations of this data were simulated to give 
a set of measurements to run the algorithm. The Condensation algorithm was 
simulated for 50 steps on the sequence of observations using 100 samples, with 
diffusion and observation densities the same as those used to generate the point 
positions and observations, so any misbehavior is intrinsic to Condensation 
itself, and not due to a bad estimate of the dynamics. Condensation was 
initialized with half the samples at -1 and half at 1. As figure El indicates, modes 
are lost quite quickly. The loss of these modes results in a fall in the variance of 
the representation of the posterior (figure EJ • 

To continue the analogy with genetics, the gaussian diffusion plays a role akin 
to mutation, in preserving diversity among the samples. But the diffusion is small 
relative to the separation between the modes, so the diffusion of samples from one 
mode to the other would take many steps. Since the perilously small observation 
density puts points between the modes at a severe selective disadvantage, such 
a journey is highly improbable. Consequently, all the samples tend to cluster in 
a neighborhood of one of the modes. 

3.2 Gaining Modes without Measurements 

The state space is now the interval [—1, 1]; we supply a small diffusion process, 
and use a uniform observation density — i.e. there is no information at all about 
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Fig. 6. On the top line, we show the Condensation representation of the prior density 
for time fc + 1 for the example of section^^ where the posterior has two modes, plotted 
for two sample runs, nsing 100 samples. In the first run the mode near -1 is lost after 
around 25 steps, and in the second run the mode near 1 is lost after around 15 steps. 
The bottom left figure shows the variance computed for the representation at each time 
step for the two runs; this variance declines because the samples collapse to one or 
another of the modes. The bottom right figure graphs the means computed from the 
weighted samples at each time step. Notice that for a single run, the mean estimate 
looks rather good (it hardly changes from stage to stage) but across runs, it looks bad 
(section 0J. 



where objects are. In this case, if the samples representing the posterior at step 
k are close together, then they will almost certainly be close together at step 
A; + 1 (because the probability of a sample moving a long distance with a small 
diffusion process is very small). This suggests (but does not prove) that in this 
case the Condensation algorithm yields an Markov chain whose stationary 
distribution has most of its weight on “clumped” representations. 

The effect is very noticeable in a simulation. The Condensation algo- 
rithm was simulated for 100 steps using 20 samples, with diffusion given by 
P{xt\xt-i) oc exp(— (xt — a;t_i)^/2(.05)^), observation density P{z\x) uniform 
on [—1, 1], and initial distribution uniform on [—1,1], yielding the results of fi- 
gure 0 Notice that quite tight clumps of samples appear (in different places in 
each run) suggesting quite falsely that the tracker is actually tracking something. 
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Fig. 7. The Condensation algorithm was simulated for 100 steps using 20 samples, 
on the problem described in section na where no information is available about the 
position of an object and so the posterior should be uniform at every step. The plots 
above show the density estimated from the samples at each stage t, in two trial runs. 
Note that the probability quickly becomes concentrated, though it should remain uni- 
formly distributed, and that the region in which it becomes concentrated differs from 
run to run; this effect makes Condensation experiments very difficult to evaluate 
(section 



4 Discussion 

While a representation of a posterior may use N samples, these samples may 
not be very effectively deployed (for example, if all are clumped close together). 
The resampling phase in Condensation guarantees that there is a non-zero 
probability of losing modes and a very small probability of regaining them. This 
means that the effective number of samples goes down with each resampling 
step. Generally, this effect is governed by the dynamics of the Markov chain; 
unusually, one wishes the chain not to burn in (and perhaps reach an absorbing 
state). This means that to be able to use Condensation with a finite number 
of samples and a guarantee of the quality of the results, one must be able to 
bound the second eigenvalue of the Markov chain below. This is sometimes as 
difficult as bounding it above, which is required to guarantee good behaviour 
from Markov chain Monte Carlo (e.g. Ha)- 

Nothing here should be read as a suggestion that Condensation not be 
used, just that, like other sampling algorithms, it should be used very carefully. 
Generally, bounds will not be available, so that the algorithm’s usefulness de- 
pends on designing experiments to take into account the possible effects of sample 
impoverishment . 

Sample impoverishment makes experiments difficult to evaluate, because very 
poor estimates of a posterior may look like very good estimates. Representations 
of a probability distribution are mainly used to compute expectations (e.g. the 
center of gravity of a tracked object, etc.) as a weighted sum over samples. A 
standard technique for checking the quality of an estimate of an expectation is 
to look at the variance of these estimates. Now assume that we are tracking a 
stationary object with Condensation; the estimate of the object’s center of 
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Fig. 8. The top two plots below show the mean of the samples at each time step for 
the two distinct trial runs for the problem of section (Since the density shonld be 
uniform at each stage, the mean should be 0.) The bottom two plots below show the 
variance of the samples at each time step for the two trial runs. (Since the density 
should be uniform at each stage, the variance should be 2/3.) The tight clusters and 
slow drift of the clusters makes it look like there’s an object being tracked even though 
we have no idea whatsoever where the object is. 



gravity from frame to frame may have low variance, while the actual variance 
— which is obtained by looking at the estimates obtained by starting the algo- 
rithm in different places — is high (c.f. the last three sentences of the second 
footnote). This effect appears in both the discrete (figure|21) and continuous cases 
(figures |3 Hj) . 

Another approach to combating this effect is to use fewer resampling steps 
(as in the SIS/ SIR algorithm of P, where an estimate of the effective number 
of samples is used); this probably involves using a more heavily constrained 
dynamical model so that fewer resampling steps are required. Finally, one might 
generate new samples occasionally. 

5 Acknowledgements 

Thanks to the referees for their helpful suggestions. 



How Does CONDENSATION Behave with a Finite Number of Samples? 



709 



References 

1. A. Blake and M. Isard. Condensation - conditional density propagation for visual 
tracking. Int. J. Computer Vision, 29(l);5-28, 1998. 

2. B.P. Carlin and T.A. Louis. Bayes and empirical Bayes methods for data analysis. 
Chapman and Hall, 1996. 

3. J. Carpenter, P. Clifford, and P. Fearnhead. Improved particle filter for non-linear 
problems. IEEE Proc. Radar, Sonar and Navigation, 146(l):2-7, 1999. 

4. A. Doucet. On sequential simulation-based methods for bayesian hltering. Tech- 
nical report, Cambridge University, 1998. CUED/F-INFENG/TR310. 

5. W.J. Ewens. Population Genetics. Methuen, 1969. 

6. W.J. Ewens. Mathematical Population Genetics. Springer, 1979. 

7. A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. 
Chapman and Hall, 1995. 

8. K. Kanazawa, D. Roller, and S. Russell. Stochastic simulation algorithms for 
dynamic probabilistic networks. In Proc Uncertainty in AI, 1995. 

9. J.S. Liu and R. Chen. Sequential monte-carlo methods for dynamic systems. Tech- 
nical report, Stanford University, 1999. preprint. 

10. J.R. Norris. Markov Chains. Cambridge University Press, 1997. 

11. D.T. Suzuki, A.J.F. Griffiths, J.H. Miller, and R.G. Lewontin. An Introduction to 
Genetic Analysis. W.H. Freeman, 1989. 

12. L. Tierney. Introduction to general state-space markov chain theory. In W.R. 
Gilks, S. Richardson, and D.J. Spiegelhalter, editors, Markov chain Monte Carlo 
in practice. Chapman and Hall, 1996. 




On the Structure and Properties of the 
Quadrifocal Tensor 



Amnon Shashua and Lior Wolf 



School of Computer Science and Engineering, 
The Hebrew University, 

Jerusalem 91904, Israel 
e-mail: {shashua, IwolfjOcs .huji .ac.il 



Abstract. The quadrifocal tensor which connects image measurements 
along 4 views is not yet well understood as its counterparts the funda- 
mental matrix and the trifocal tensor. This paper establishes the struc- 
ture of the tensor as an “epipole-homography” pairing 

Q ijkl /j rrikl Hk rriil , ml rriik 

=v H —V H ■' -I- V H 

where v' ,v" ,v'" are the epipoles in views 2,3,4, H is the “homography 
tensor” the 3-view analogue of the homography matrix, and the indi- 
ces are attached to views 1,2, 3, 4 respectively — i.e., is the 

homography tensor of views 1,3,4. 

In the course of deriving the structure we show that Linear Line 

Complex (LLC) mappings are the basic building block in the process. We 
also introduce a complete break-down of the tensor slices: 3x3x3 slices 
are homography tensors, and 3x3 slices are LLC mappings. Furthermore, 
we present a closed-form formula of the quadrifocal tensor described by 
the trifocal tensor and fundamental matrix, and also show how to recover 
projection matrices from the quadrifocal tensor. We also describe the 
form of the 51 non-linear constraints a quadrifocal tensor must adhere 
to. 

1 Introduction 

The study of the geometry of multiple views has revealed the existence of certain 
multi-linear forms that connect image measurements of points and lines across 
2,3,4 views. The coefficients of these multi-linear forms make up the fundamental 
matrix ism in the case of two views, the trifocal tensor mnm of three views, 
and the quadrifocal tensor in case of four views [r2llbl7t-{j . 

Among the three, the quadrifocal tensor is the least understood. What is 
known so far is somewhat fragmented and includes (i) the existence of 16 quad- 
linear forms per quadruple of matching points across 4 views 
(ii) quadlinear forms between a pair of matching quadruples form a rank 31 sy- 
stem ^5) (iii) the fact that the coefficients of the tensor are Grassman coordinates 
[bil I l)j . (iv) that the quadlinear forms are spanned by trilinear and bilinear forms 
0 (using, however, symbolic algebra and random camera configurations), (v) 

D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 710 4773 2000. 

(c) Springer- Verlag Berlin Heidelberg 2000 
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3x3 slices of the quadrifocal tensor are rank-2 matrices |3, and (vi) equations 
for recovering fundamental matrices and trifocal tensors from the quadrifocal 
tensor Q, albeit not as far as recovering the camera projection matrices in a 
general manner (only for ’’reduced” representations IhlhU l. and (vii) the source 
of the 51 non-linear constraints and their form is still an open issue. 

In this paper we derive the quadrifocal tensor ’’bottom up” similarly to the 
way the trifocal tensor was derived in mm and establish the following results. 
First and foremost we obtain an explicit formula that describes the tensor as a 
sum of epipole-homography outer-products: 

Qijkl _ y'3 fjijk 

where v' , v", v'" are the epipoles in views 2,3,4, H is the “homography tensor” the 
3-view analogue of the homography matrix, and the indices i,j, fc, I are attached 
to views 1,2, 3,4 respectively — i.e., is the homography tensor of views 1,3,4. 

In the course of deriving the representation above we find that the Linear Line 
Complex (LLC) mapping (introduced in the past in the context of ambiguities 
in reconstruction HS|) forms a basic building block in the construction of the 
tensor. The explicit representation allows us to introduce a complete break-down 
of the tensor slices: 3x3x3 slices are homography tensors, and 3x3 slices are 
LLC mappings. Furthermore, we present a closed-form formula of the quadrifocal 
tensor described by the trifocal tensor and fundamental matrix, and also show 
how to generally recover projection matrices from the quadrifocal tensor. Finally, 
we describe the form of the 51 non-linear constraints a quadrifocal tensor must 
adhere to. 

2 Notations and Background 

We will be working with the projective 3D space and the projective plane. In 
this section we will describe the basic elements we will be working with (i) colli- 
neations of the plane (ii) camera projection matrices, (iii) Linear Line Complex 
(LLC) mapping, and (iv) tensor notations. 

A point in is defined by three numbers, not all zero, that form a coordinate 
vector defined up to a scale factor. The dual projective plane represents the 
space of lines which are also defined by a triplet of numbers. A point p in the 
projective plane coincides with a line s if and only if s = 0, i.e., the scalar 
product vanishes. In the projective plane any four points in general position can 
be uniquely mapped to any other fours points in the projective plane. Such a 
mapping is called collineation and is defined by 3 x 3 invertible matrices, defined 
up to scale. These matrices are sometimes referred to as homographies . If iL is a 
homography matrix, then H~^ (inverse transpose) is the dual homography that 
maps lines onto lines. 

A point in is defined by four numbers, not all zero, that form a coordinate 
vector defined up to a scale factor. The dual projective space represents the space 
of planes which are also defined by a quadruple of numbers. The projection from 
3D space to 2D space is determined by a 3 x 4 matrix. A useful parameterization 
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(which is the one we adopt in this paper) is to have the 3D coordinate frame and 
the 2D coordinate frame of view 1 aligned. Thus, in the case of two views we have 
[/; 0], [A] v'] be the two camera matrices from 3D space to views 1,2 respectively. 
The matrix A is a homography matrix from view 1 to 2 whose corresponding 
plane is the “reference” plane, v' is the intersection point between the two camera 
centers (the null spaces of the respective projection matrices) and view 2 (known 
as the epipole). Additional views, [B-, u"], [C; u"'], etc., must all agree on the same 
reference plane. In other words, the homographies A,B,C, .. all form a group, 
i.e., the homography matrix from view 2 to 3 is BA~^, for example. Note that 
the choice of the reference plane is free — a fact that provides 3 free parameters 
(the “gauge” of the system) when setting up a set of projection matrices that 
agree with the image measurements. If At^^At^ are two homography matrices 
between views 1,2 associated with planes 7r,7r, then At^ = A^ + v'w^ where w 
is the projection onto view 1 of the line intersection of tt and tt. If A is a line in 
view 2, for example, then X^[B; v”] is the plane passing through the line A and 
the second camera center. 

Another useful transformation between views of a 3D scene is the one resul- 
ting from a Linear Line Complex (LLC) configuration. An LLC configuration 
consists of a set of lines in 3D that have a common line intersection (referred to 
as the kernel of the set). Let L be the kernel of the set, and let its projections 
onto views 1,2 be I, I' respectively (see Fig.P). Let A be a homography matrix of 
any plane containing L, then G = A[l]j: (where \\x is the skew-symmetric matrix 
of cross-products, see later) is a unique transformation (does not depend on the 
choice of the plane) that satisfies s'^ A\l]xS = 0 for all matching lines s, s' in 
views 1,2 respectively arising from lines of the LLC configuration. The left and 
right null spaces of G are the projections of L on views 1,2. 

It will be most convenient to use tensor notations from now on because the 
material we will be using in this paper involves coupling together pairs of colli- 
neations and epipoles into a “joint” object. When working with tensor objects the 
distinction of when coordinate vectors stand for points or lines matters. A point is 
an object whose coordinates are specified with superscripts, i.e., p* = (p^,p^,p^). 
These are called contravariant vectors. A line in is called a covariant vector 
and is represented by subscripts, i.e., Sj = (si,S 2 ,S 3 )- Indices repeated in cova- 
riant and contravariant forms are summed over, i.e., p'si = p^si + p^S 2 Ap^ss. 
This is known as a contraction. For example, if p is a point incident to a line s 
in then p*Sj = 0. 

Vectors are also called 1-valence tensors. 2-valence tensors (matrices) have 
two indices and the transformation they represent depends on the covariant- 
contravariant positioning of the indices. For example, aj is a mapping from 
points to points (a collineation, for example), and hyperplanes (lines in V^) 
to hyperplanes, because a^p* = and a^Sj = (in matrix form: Ap = q 
and = r); aij maps points to hyperplanes; and maps hyperplanes to 
points. When viewed as a matrix the row and column positions are determined 
accordingly: in al and aji the index i runs over the columns and j runs over 
the rows, thus b^a{ = cf is BA = G in matrix form. An outer-product of 



On the Structure and Properties of the Quadrifocal Tensor 713 




Fig. 1. A linear line complex (LLC) is a configuration of lines in 3D that have a common 
line intersection, the kernel, L. Let I be the projection of L in view 1 and let A be a 
homography of some plane containing L. Then G = A\l\x. is the LLC mapping that 
s'^ Gs = 0 for all matching lines s, s' in views 1,2 respectively arising from lines of the 
LLC conhguration. 



two 1-valence tensors (vectors), QiV , is a 2-valence tensor whose i,j entries 
are aiV — note that in matrix form C = baJ . A 3- valence tensor has three 
indices, say . The positioning of the indices reveals the geometric nature of 
the mapping: for example, must be a point because the i,j indices drop 

out in the contraction process and we are left with a contravariant vector (the 
index k is a superscript). Thus, maps a point in the first coordinate frame 
and a line in the second coordinate frame into a point in the third coordinate 
frame. The trifocal tensor in multiple- view geometry is an example of such a 
tensor. A single contraction, say p'Hf^ , of a 3-valence tensor leaves us with a 
matrix. Note that when p is (1, 0, 0) or (0, 1, 0), or (0, 0, 1) the result is a “slice” 
of the tensor. 

We will make extensive use of the “cross-product tensor” e defined next. 
The cross product (vector product) operation c = axb is defined for vectors in 
V^. The product operation can also be represented as the product c = [a]x& 
where [a] x is called the “skew-symmetric matrix of a” . In tensor form we have 
eijktt^V = Ck representing the cross product of two points (contravariant vectors) 
resulting in the line (covariant vector) Cfc. Similarly, e'^^aibj = represents 
the point intersection of the to lines Oi and bj. The tensor is the anti- 
symmetric tensor defined such that eijko'Vc^ is the determinant of the 3x3 
matrix whose columns are the vectors a,b,c. As such, e^-fc contains 0,3-1,— 1 
where the vanishing entries correspond to arrangement of indecis with repetitions 
(21 such entries), whereas the odd permutations of ijk correspond to —1 entries 
and the even permutations to 3-1 entries. 
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3 Quadrifocal Tensor Bottom-Up 

Consider 4 views with the following 3x4 projection matrices: [/; 0], \A] v'], \B] v”], 
[C; v'"] associated with views 1,2, 3, 4 respectively. By definition, the matrices 
A,B,C are homography matrices from view 1 onto 2,3,4 respectively through 
some reference plane tt. Let P be some point in 3D projective space projecting 
onto p,p',p",p"' in the four images, i.e., p = [I; 0]P,p' = [A; v']P,p” = [B; v"]P 
and p'” = \C]v'”]P. Let L be some 3D line passing through P and let the 
projections of L onto views 3,4 be denoted by r,t, thus L is the intersection of 
the two planes r^[B; v"] and t^[C; v'"]. See Fig. 0 as a reference from now on. 




Fig. 2. The construction of a LLC mapping between views 1,2 where the kernel line 
L is determined by the lines r, t in views 3,4. In the construction one needs to express 
the projection of L in view 1 and consider the projection of the intersection point of L 
and the reference plane tt on view 1 — the point r x t. 



We wish to construct an LLC mapping, a 3 x 3 matrix Q{r, t), between views 
1,2 whose kernel is the line L. Before we do so , it is worthwhile noting what we 
would gain from it. From the definition of an LLC mapping, we have: 

s^Q(r, t)q = 0 

for all lines s passing through p' and all lines q passing through p. In other 
words, Q{r, t) = is a 3 x 3 double contraction of the quadrifocal tensor 

(linear combination of 3 x 3 slices). Thus, our mission would be almost completed 
if we derive Q(r,t) — almost, because one must separate the contribution of 
q,s,r,t from the contribution of the camera matrices in order to get a form 
qiSjTktiQ'^^^^ = 0. To conclude, we have two steps left, one is to derive Q(r, t) as 
the LLC between views 1,2 whose kernel is L, and the second is to separate the 
image measurements and the camera matrices from the equation Q{r, t)q — 0 
(this is where the power of tensor notations becomes critical). 
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Let ar^[B] v"] + j3i^\C\ v'”] be the pencil of planes whose axis is L, parame- 
terized by the scalars a, 13. Let A be the projection of L onto view 1, thus the 
plane A^[/;0] belongs to the pencil: 





■/3 



C't 

v'"^t 



Therefore, A = {v'”^t)B^r — {v”^r)C^t. Let a be any plane through L, then 
the homography matrix from view 1 to 2 through a is, = A -\- v'n^ , where 
n is the projection of the intersection line between planes tt, a onto view 1. By 
definition of LLC mapping we have: 

Q{r,t) = A„[\]x = A[A]x -h v'{n x A)^. 

Note that n x A is the projection of the intersection point between L and tt onto 
view 1, and furthermore, 

n X A = X bA r 

because C^t is the projection of the intersection of the planes tt and A\C\ v'”] 
in view 1, and B^r is the projection of the intersection of the planes tt and 
r^[B] v"] in view 1. The intersection of the two lines in view 1 is the projection 
of the intersection point between L and tt onto view 1. Taken together, we have 
an explicit equation for Q{r,t)' 

Q{r, t) = v'{AC X r^B) - {v''^ r)A[C^ t\, + t)A[B^ (1) 

And the quadlinearity s^Q{r,t)q = 0 for all lines q,s,r,t in views 1,2, 3, 4 res- 
pectively that coincide with their respective image point p,p' ,p” ,p"' is: 

{s^v'){AC X B)q — {v""^ r)s^ A[C"^ t]xq (2) 

-I- t)s^ A[BA r]xq = 0. 



We have so far observed that the LLC mapping is a basic building block in 
constructing the quadlinearity above. It is worthwhile noting that the quadli- 
nearity above can be also derived “top-down” by a determinant expansion, as 
follows. Since the planes 0], r^[A; v'], s^[B-, v"], A[C] v'"] meet at the point 
P, the determinant below must vanish: 



q^ 0 
A s^v' 
B v” 

Ac A v'" 



= 0 



After expanding the determinant by its fourth column we obtain Eqn. 0 
above. In order to continue, we introduce another basic block the “homography 
tensor” of three views. Referring to Fig. 0 consider the line of intersection L 
between the plane tt and the plane A\B\ v"]x- Consider some point P on L and 
its projections p,p',p'' onto views 1,2,3 respectively. Since the projection of L 
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P. 




Fig. 3. The construction of the Homography Tensor. A line r in the third view determi- 
nes by intersection of the viewing plane and the plane tt a line L in 3D forming a kernel 
of an LLC mapping A[B^r]x between views 1,2. In tensor form we have qiSjrkH’'^^ ~ 0 
where q, s, r are lines through a matching triplet of points corresponding to points on 
the plane tt. Thus, is a 3-view analogue to the homography matrix. See 1 1 2] for 

more details. 



onto view 1 is r, then the LLC mapping between views 1,2 with L as the 
kernel is A[B^ r]^. By definition of LLC mapping, we have A[B^ r]^q = 0 for 
all lines q, s that are coincident with p, p' respectively. 

From this point on we will move to tensor notations - necessary step in order 
to separate the image measurements q, s, r, t from the camera projection matri- 
ces A,B,C,v' ,v" ,v"' . We adopt the notation that indices are associated 

exclusively with views 1,2, 3,4. For example, since v' is a point (epipole) in view 
2, then when placed in a tensor equation it will always appear as likewise s 
is a line in view 2, then in a tensor equation it will always appear as Sj . 
Rewriting A\BA r]q = 0 in tensor form we have: 

= 0 

and denote 

^ (3) 

The tensor J3 ^ homography (collineation) mapping of the plane tt associated 
with 3 views. For example, 

^ P'^ . 

Just like a homography matrix it can map directly a point in any view onto 
its matching point in any other view (not described here). Its 3 x 3 slices are 
LLC maps: we saw that jg the LLC map A\B^ r]^ between views 1,2 of 
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the line L\ likewise, is the LLC map B\A^ s]x between views 1,3 whose 

kernel is the intersection of tt and the plane and is the LLC 

map B[q\xA^ between views 2,3 whose kernel is the intersection of tt and the 
plane q^[I\ 0]. Note also that B[q\xA^ s = {B^r){q x = q^ {B^r x A^s). 

The tensor jg therefore the extension of the 2- view homography matrix and 
is referred to as “Homography Tensor” (Htensor in short) — further details are 
beyond the scope of this paper and the reader is referred to m- 

Returning to the quadlinearity in eqn.|^ we notice that all three terms consist 
of homography tensors of the plane tt of views (1, 3, 4) and (1, 2, 4) and (1, 2, 3). 
Using our notation of indices, we have for the Htensor of views (1,3,4), 
ijb* for the Htensor of views (1, 2, 4) and ijbfe for the Htensor of views (1, 2, 3). 
Therefore, the quadrifocal tensor is: 

and the quadlinearity in eqn. Elis simply 

qiSjrkUQ'-^’^'- = 0 . 

Finally, the form of does not depend on the position of the reference plane 
7T. Changing the reference plane to tt results in the new set of camera projection 
matrices [/; 0], [H + [B + v"w^\, [C + v'"w"^\ where w is the projection 

onto view 1 of the intersection line between tt and tt. By substitution in eqn 21 
one notices that the terms with w drop out — details are in the full version of 
this paper 

4 Properties of the Quadrifocal Tensor 

Since every quadruple of lines q, s, r, t coincident with the matching points 
p",p'", respectively, contributes one equation qiS^r^tiQ"^^^^ = 0 we have a total 
of 16 linearly independent equations per matching quadruple of points. Hartley 
^ first noticed and proved that two quadruples contribute only 31 linearly inde- 
pendent equation, and every additional quadruple contributes one less equation, 
thus 6 matching quadruples contribute 16 -1- 15 -I- 14 -|- 13 -I- 12 -|- 11 = 81 linearly 
independent equations for the 81 coefficients of the tensor. One can obtain a 
simpler geometric proof of why this is so: 

Each quadlinearity is spanned by a set of 16 quadlinearities 

= 0 p,fofo5=l,2 

where q},qf are two lines, say the horizontal and vertical scan lines, passing 
through p, etc. Given any lines q, s, r, t passing through the matching points, 
then each line is linearly spanned by the horizontal and vertical scan lines and 
this linear combination carries through to a linear combination of the 16 quadli- 
nearities above. Given a second matching quadruple, p^p' ,p” ,p”' , then the quad- 
linearity resulting from taking the lines q = p x p, s = p' x p' ,r = p" x p" and 
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t = p'” X p'” is spanned by the 16 quadlinearities of the first set of matching 
quadruple p,p',p'',p"' (see Fig. 0). Thus, the two subspaces have a single non- 
trivial intersection and the total rank is 31 (instead of 32). Likewise, the n’th 
additional quadruple of matching points has n — 1 quadlineairites spanned by 
the previous n — 1 subspaces. 



Fig. 4. A quadruple of matching points provides 16 constraints. A second quadruple 
provides only 15 additional constraints because the constraint defined by the lines 
connecting the the two sets of points is already covered by the 16 constraints of the 
first quadruple. 



4.1 Slicing Breakdown 

We move our attention to the breakdown of slices of the tensor. From the con- 
struction of the tensor and the symmetrical role that all views play (all indices 
are contravariant, unlike the trifocal tensor which has both covariant and con- 
travariant indices) we conclude: 

Theorem 1. Let x,y, x ^ y, be a pair of indices from the set {i,j,k,l}, let 
Mx,My he the camera projection matrices associated with the choice of view that 
the indices x, y represent, and let S, p range over the standard basis (1, 0, 0), (0, 1, 
0), (0, 0, 1). Every 3x3 slice of the quadrifocal tensor corresponds to 

an LLC mapping between the remaining views not represented by x,y and whose 
kernel is the line intersection of the planes 5^ and pJ My. For example, if 
X = i,y = j then SiPjQ'^^^’' provide 9 slices, each slice is an LLC map between 
views 3,4 and whose kernel is the intersection of the planes 0] and p^\A] v']. 

Note that in particular a 3 x 3 slice is a rank-2 matrix (as observed by 
0 ), but not every rank-2 matrix is an LLC mapping. Note also that any linear 
combination of the 9 slices SxPyQ''^^^, for a fixed choice of x,y, is also an LLC 
mapping. Thus the finding above is a stronger constraint on the structure of the 
3x3 slices of the tensor then what was known so far. 

Next we state (proof in ^3|) that every 3x3x3 slice of the quadrifocal 
tensor corresponds to a homography tensor: 
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Theorem 2. Let x he an index from the set {i,j,k,l}, let he the camera 
projection matrices associated with the choice of view that the index x represents, 
and let S range over the standard basis (1, 0, 0), (0, 1, 0), (0, 0, 1) . Then every 
3x3x3 slice is a homography tensor between the three remaining views 

of the plane 5^ M^- 



4.2 Quadrifocal Constructed from Trifocal and Bifocal 

We move our attention to the construction of the quadrifocal tensor from lower 
order tensors: the trifocal tensor of 3 views and the fundamental matrix of two 
views. Using symbolic algebra on random camera configurations, Faugeras & 
Mourrain have concluded that the quadlinear forms are spanned by trilinear 
and bilinear ones. We will use the LLC building block to derive a closed form 
formula on representing the quadrifocal tensor as a function of trifocal tensors 
and fundamental matrices. Let Y {5, p) = be the LLC slice of Q between 

views 3,4. Then, 

Y{5,p) = [SiPjTf^]xF34[SiPjT^^]x 

where 7^^ is the trifocal tensor of views (4, 1, 2) and is the trifocal tensor of 
views (3, 1,2), and F 34 is the fundamental matrix between views 3,4. 

To see why this expression holds, let L be the line intersection of the planes 
0] and p^[A; v']. Note that is the projection of L onto view 4, and 

SiPjl2^ is the projection of L onto view 3. The fundamental matrix flanked by 
both sides by the skew-symmetric matrix of the projections of the kernel line is 
the LLC map between views 3,4. 

By varying 6, p to range over (1, 0, 0), (0, 1, 0), (0, 0, 1) we obtain a closed form 
formula of the nine 3x3 slices 



Qllkl ^12kl ^S3kl 

denoted by T(l, 1), ..., T (3, 3), making up the quadrifocal tensor. However, each 
slice is up to scale. The 9 scale factors Ai, ..., Ag can be recovered (up to a global 
scale) by setting 5, p each to be (1, 1, 1) in which case we have: 



F(5,/r) = AiT(l,l) + .... + A9T(3, 3) 



which provides a linear system for recovering the scale factors. In conclusion we 
have: 

Theorem 3. The quadrifocal tensor can be constructed from two trifocal tensors 
and one fundamental matrix. 

4.3 Fhndamental Matrix from Quadrifocal Tensor 

We next move our attention to the construction of lower order tensors from the 
quadrifocal tensor. We will start with the fundamental matrix. Let Y (6, p) = 
SiPjQ^^^’’ be the LLC slice of Q between views 3,4 with kernel line L. Let I'", I” be 
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the projections of L onto views 3,4 respectively, thus Yl” = 0 and V" = 0. Let 
-F 34 be the fundamental matrix between views 3,4. Let r be some line in view 3, 
then Fr is a point coincident with F^ 4 \l"]xr (see Fig.0. Thus, r^Y^ F^ 4 \l"]g;r = 
0 for all r, therefore Y^ F^/^[l'']x is a skew-symmetric matrix. The skew-symmetric 
constraint provides 2 linear equations for T 34 . By varying 6, ^ over the standard 
basis we obtain nine LLC slices, each provides 2 constraints on T 34 . 




Fig. 5. Constructing the fundamental matrix F 34 between views 3,4 from the Quadri- 
focal tensor. We use the 3x3 slices which form a LLC mapping Y between 

views 3,4 with a kernel line L (whose projection in views 1,2 are The left and 
right null spaces of Y are the projections of L on views 3,4. If r is some line in view 3, 
the Yr is a point coincident with F 34 [l"]xr, thus Y^ Fi 4 \l"\ is a skew-symmetric matrix 
providing 2 constraints on F 34 . 



4.4 Trifocal from Quadrifocal 

To recover the trifocal tensor, say of views (1,2,3) consider the following. 
Consider two lines in view 4. Then ijbfe = and Lfb* = 

are two Htensors of views (1,2,3) associated with two distinct planes. Then, 
is a point in view 1 , and jg another point in view 1 , but these 

two points lie on the line SjrYT^^ ■ Thus, we have the constraint: 

By varying s,r to range over the standard basis, we obtain slices (each up to 
scale) of the trifocal tensor: 

-7-11 -7-33 

^ i 5 — 1 ' i 

The nine scale factors are recovered in two stages. First, by setting s = (1, 1, 1) 
and varying r over the standard basis we obtain a linear system for sets of three 
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scale factors. We are thus left with three 3x3 slices each up to 

scale. By setting both s,r to (1, 1, 1) we obtain a linear system for the three 
remaining scale factors. 

4.5 Projection Matrices from Quadrifocal Tensor 

Finally, we move our attention to the construction of the camera projection 
matrices from the quadrifocal tensor. Consider again eqn.^ 

Qijkl _ y'j jjikl _ _|_ y"'^jjijk 

The epipolar points v',v",v'" are the null spaces of the fundamental matrices 
which were recovered in Section lO Thus, we have 81 constraints for solving 
for the three Htensors and ^ji^t together form 81 unknowns. 

There are two interesting points to make here. First, all three Htensors corre- 
spond to the same reference plane in space — thus if we extract the constituent 
homography matrices out of the Htensors, then together with the epipoles we 
have an admissible set of camera projection matrices. Second, due to the gauge- 
invariance property of the multi- view tensors we have three degrees of freedom 
thus the rank of the estimation of the 81 variables of the three Htensors is 78. 
We are free to choose any solution spanned by the null space (the choice will 
determine the gauge, i.e., the location of the reference plane). 

What is left is to show how to extract the homography matrices from view 
1 to 2,3,4 from the Htensors. Consider for example the Htensor of views 
(1,2,3) with constituent homography matrices A, H. Because 3x3 slices cor- 
respond to LLC maps, it is possible to extract from them the homographies 
A,B. For example, produces an LLC map A\B^ — by allowing 5 

to range over the standard basis (1, 0, 0), (0, 1, 0), (0, 0, 1), we obtain three such 
matrices, denoted by Ei,E 2 ,E^. We have that AEj + EiA^ = 0, i = 1,2,3, 
thus providing 18 linear equations for A. Similarly, one can find in this manner 
the other homographies B,C — each up to scale. The three scale factors can be 
determined by using eqn. 0 again, this time the Htensors are constructed from 
the homography matrices where the unknowns are the three scale factors. 

4.6 The 51 Non-linear Constraints 

The quadrifocal tensor is represented by 29 parameters (44—15 = 29) thus we ex- 
pect 51 non-linear constraints (“admissibility” constraints) on the 81 coefficients 
(up to scale) of the tensor. This issue has so far been unresolved. Heyden |Z| con- 
jectures that the source of these constraints comes from the rank-2 property of 
the 3x3 slices. But as we saw in Theorem Q] the matter is more complicated 
because the slices are LLC maps, and not every rank-2 matrix is an LLC map. 
We will use the 3x3x3 slices, the Htensors, to buildup those 51 constraints, 
as described below. 

Consider the three 3x3x3 slices by letting S range over the stan- 

dard basis (1, 0, 0), (0, 1, 0), (0, 0, 1). From Theorem 2 we know that these slices 
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are Htensors denoted by ni = 1,2,3. Note that all three Htensors are of 

views (1, 2, 3) each corresponding to a different plane. Let be the con- 

stituent homography matrices of the m’th Htensor. We know that A( 2 ), 
are homography matrices from view 1 to 2 corresponding to three different pla- 
nes 7ri,7r2,7T3, and B( 2 ), i?( 3 ) are homography matrices from view 1 to 3 
corresponding to the same planes 7ri,7r2,7T3, respectively. 

We will divide the 51 constraints into two sets. The first set consists of 

9 X 3 = 27 constraints which describe the constituent homography matrices 

from their corresponding Htensors. The second set consists of 24 constraints 
which embody the relationship between as described above. 

Recall that an Htensor produces 18 linear constraints for each of its homo- 
graphy matrices. Consider the three 3x3 slices Htensor 

and denote the resulting matrices by Ei,E 2 ,E^. We know that + 

EiAj^^ = 0, i = 1,2,3, which provide 18 constraints for ^(i). Choose 8 con- 
straints from these 18 constraints. Thus, each entry of the matrix A(j-^ can be 
represented by a determinant expansion of an 8 x 8 matrix whose components 
come from the tensor elements that participate in those 8 constraints. The remai- 
ning 10 constraints must be of rank 8, thus by substituting A(i) in the remaining 

10 constraints we have 8 polynomials of degree 9 on the entries of Ei,E 2 , E^ that 

must be satisfied in order that Af^i^Ej + EiAtJ^-^ = 0 for i = 1, 2, 3. In this way we 
may solve for H(i), but this does not add new constraints because we are using 
the same information used to derive ^(i). The scale of the Htensor is set because 
it is a slice of the quadrifocal tensor - yet the scales of are arbitrary. 

Therefore, there is another constraint that is captured as follows. Let a be some 
unknown scale of A(^i^, such that L7(i) = a^(i) 0 where ® is a short-cut 
denoting the Htensor equation 0. This provides 27 equations. Choose any two 
of them and eliminate a — the results is non-linear equaiton in the elements of 

and S(i). Taken togetner, we have 9 non-linear constraints from L7(i) and 
since this is true for m = 2, 3 as well, we have 27 non-linear constraints. 

We have 27 constraints, and have represented by determinant 

expansions of the entries of the quadrifocal tensor. Because are between 

view 1 to 2, and B(m) are between view 1 to 3, and each pair A(^rn), B(^rn) are 
associated with the same plane tt^, m = 1, 2, 3 respectively, we have: 

A(^2) — -^i^(i) T v'w^ 

^ 2 B{ 2 ) = A 3 H(i) -|- v”llA 
^( 3 ) = ^ 4 ^.( 1 ) -|- v'w^ 

^5-6(3) = -I- v"w^ 

where w,w is the projection onto view 1 of line intersection between 7ri,7r2 
and between 7Ti,7r3 respectively. We have 36 equations in A(^m), B(^rn) (which 
are represented in terms of determinant expansions of the quadrifocal tensor 
elements), the epipoles v',v" (which are also represented as non-linear functions 
of the quadrifocal elements — as we saw in Section n~3ll and 12 variables: 6 from 
w, w and the scales Ai, ..., Ag. By elimination of the 12 variables we are left with 
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36 — 12 = 24 polynomials on the elements of the quadrifocal tensor alone. Taken 
together we have 27 + 24 = 51 constraints (plus the constraint arising from the 
global scale factor) . The two sets of constraints are algebraically independent as 
they use different information: the first set arises from the rank-2 LLC map of 
the 3x3 slice of the Htensors which provides constraints on the homography 
matrices, and the second set arises from the relationships between the individual 
homography matrices. 

5 Summary 

We have derived an explicit form of the quadrifocal tensor (eqn. 0|) analogous to 
the forms of the lower order tensors of multi-view geometry, the trifocal tensor 
and the fundamental matrix. The lower-order tensors have explicit forms as 
an epipole-homography outer-products. The fundamental matrix is formed by a 
single epipole-homography coupling F = and the trifocal tensor is formed 

by two pairs of epipole-homography: = v'^b^ — v"^a{. We have shown that 

the quadrifocal tensor is formed by three pairs of epipole-Htensor couplings: 

Qijkl ^ 

where the Htensors are 3- view analogue to the 2-view homography matrices (they 
perform the operation of collineations between views of a 2D configuration) . 

Using the explicit form of the tensor and the tool of LLC mapping for analy- 
sis, the slicing breakdown is relatively simple: the 3x3 slices of the quadrifocal 
tensor form LLC mappings — in particular these are rank-2 matrices, but not 
every rank-2 matrix is an LLC mapping. The 3x3x3 slices are homography 
tensors (Htensors), i.e., collineations of 3- view sets — which is analogous to 
the covariant 3x3 slices of the trifocal tensor which form homography matrices. 
Moreover, the construction of the quadrifocal tensor from the lower-order tensors 
and, vice-versa, the construction of the lower-order tensors from the quadrifo- 
cal tensor as well as the camera projection matrices were presented in detail 
— again, become relatively straightforward once the explicit form (eqn. 0 is 
available. 

Finally, the explicit form and the discovery that the 3x3x3 slices are 
homography tensors has provided a simple route to deriving the 51 non-linear 
constraints that all quadrifocal tensors must adhere to. 
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Abstract. We tackle the problem of 3D surface reconstruction by a 
single statie camera, extracting the maximum amount of information 
from gray level changes caused by object motion under illumination by 
a fixed set of light sources. We basically search for the depth at each 
point on the surface of the object while exploiting the recently proposed 
Geotensity constraint m that accurately governs the relationship bet- 
ween four or more images of a moving object in spite of the illumination 
varianee due to object motion. The thrust of this paper is then to ex- 
tend the availability of the Geotensity constraint to the case of multiple 
point light sources instead of a single light source. We first show that it 
IS mathematically possible to identify multiple illumination subspaces for 
an arbitrary unknown number of light sources. We then propose a new 
technigue to effectively carry out the separation of the subspaces by intro- 
ducing the surface interaction matrix. Finally, we construct a framework 
for surface recovery, taking the multiple illumination subspaces into ac- 
count. The theoretical propositions are investigated through experiments 
and shown to be practically useful. 



1 Introduction 

3D surface reconstruction of an object has been among the subjects of major 
interests in computer vision. Given a set of images, in each of which the object 
is viewed from a different direction, the fundamental issue in extracting 3D in- 
formation out of 2D images is to match corresponding points in those images 
so that these points are the projections of an identical point on the surface of 
the object. For the point correspondence, typically exploited is the constraint 
that the corresponding parts of the images have equivalent image intensities, 
regarding the variation in illumination as noise. It has been successfully applied 
to stereo (see for example m) where two images are taken simultaneously as 
the lighting of the object is identical in each image. However, when, as a natural 
progression, we consider replacing the stereo camera with a single camera obser- 
ving an object in motion, unfortunately the constraint is nearly always invalid 
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as non-uniform lighting causes the intensity at a specific location on the surface 
of an object to change as the object moves. Among the few efforts for this is- 
sue, whereas photometric motion ca treated the illumination variance due to 
object motion in terms of optical flow, recently Geotensity constrain has 

been derived to overcome the problem with respect to camera geometry, and to 
replace the constant intensity constraint. Based on the notion of linear intensity 
subspaces m, the Geotensity constraint governs the relationship between four 
or more images of a moving object, and it can be computed and applied au- 
tomatically to the task of 3D surface reconstruction. The algorithm for surface 
reconstruction using Geotensity constraint proceeds basically in two stages. The 
first stage is to derive the parameters of the Geotensity constraint by analyzing 
coordinates and image intensities of some sample points on the object in motion. 
That is, computing structure from motion obtains the geometric parameters of 
the situation, whereas computing the linear image subspace obtains the lighting 
parameters of the situation. By combining both sets of parameters we arrive at 
the Geotensity constraint. Using the same set of images, the second stage is to 
take each pixel in an arbitrary reference image in turn and search for the depth 
along the ray from the optical center of the camera passing through the pixel. 
The depth is evaluated by measuring the agreement of the entire set of projected 
intensities of a point on the object surface with the Geotensity constraint. 

Although the availability of the constraint was limited in principle to the 
case of a single point light source, the thrust of this paper is to propose a new 
framework that enables the constraint to be applied to a more general case of 
multiple point light sources. When multiple light sources exist, computing the 
lighting parameters in the first stage is not a simple task as was the case with 
a single light source since most points on the surface are illuminated only by 
a subset of the light sources and this subset is different for different points. 
The question is whether we can identify the lighting parameters for different 
illumination subspace arising from different subset of light sources. Knowing the 
lighting parameters, we further need to choose appropriate lighting parameters 
to search for the depth at each point on the object surface. 

In this paper, we propose a new method to solve for different illumination 
subspaces and then to recover the surface of the object. In order to solve for the 
lighting parameters, in the first stage, we develop a technique to automatically 
sort the sample points into different clusters according to the illuminating set of 
light sources. This has been made possible by introducing a matrix representation 
of property of object surface, which we call the surface interaction matrix. The 
entries of this matrix are computable only from the intensities of the sample 
points, and transforming it into the canonical form results in segmenting sample 
points. Once the segmentation is carried out, we simply solve for the lighting 
parameters individually for each cluster. In the second stage, our mechanism to 
search for the depth at each point on the surface of the object is to exploit the 
Geotensity constraint by taking all the possible lighting parameters into account. 
In principle the proposed constraint is that the correct depth with correct lighting 

^ Geotensity stands for “geometrically corresponding pixel intensity.” 
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parameters should best satisfy the Geotensity constraint whereas the agreement 
with the constraint is measured in the values of pixel intensities in the images 
themselves. This is important since we employ the Geotensity constraint so that 
the estimation for surface depth will be optimal in the sense of finding the least 
square error with respect to image noise (which is generally well understood and 
well behaved). 

Although we will show that our theory is quite general, we will make a 
number of assumptions in order to present some early results. In particular we 
will assume that the object has Lambertian surface properties and convex shape 
(therefore no self-shadowing). Despite such assumptions the key advance is that 
our formulation allows our results to be obtained fully automatically and for a 
wide range of well researched stereo algorithms to be applied directly. 

2 The Geotensity Constraint 

The Geotensity constraint for 3D surface reconstruction was first proposed in 
m, applying the notion of linear image basis m to object in motion under a 
single point light source. Here we give a brief summary of the constraint while 
introducing an additional scheme cni to explicitly solve for the illuminant di- 
rection. 

2.1 Issues Respecting Geometry and Intensity 

We first consider some issues respecting geometry and intensity that form the 
basis of the Geotensity constraint. What we initially need is to find some num- 
ber of corresponding sample points by an independent mechanism. One way to 
robustly sample proper points between frames is to employ a scheme to extract 
corners and correlate them automatically while eliminating outliers as seen for 
example in For consecutive images sampled at a sufficiently high time fre- 
quency we can also use the tracking method of Wiles et al. HU. Given point 
correspondence for some sample points, we can derive a constraint on geometry 
by the coordinates, and also a photometric constraint by observing the intensi- 
ties on these points. 

Solving for Geometry 

In this paper, for simplicity, we will concern ourselves with the affine and scaled- 
orthographic camera models m for projection. Gonsider the world point 
Xi = (A"i, Yi, Zf)^ on the surface of an object projected to image point Xi(j) = 
(3^i(i), j/i(j))^ in the frame. The affine camera model defines this projection 
to be 

x,(j) = M(j)X, + t(j) , 

where M(j), an arbitrary 2x3 matrix, and t(j), an arbitrary 2 vector, encode 
the motion parameters of the object. The solution to the structure from mo- 
tion problem using singular value decomposition is well known for this case; 
given at least four point trajectories Xi(j) observed through at least two frames 
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the set M(j), and t(j) can be uniquely recovered up to an arbitrary affine 
ambiguity m- The result is affine structure. 

Given the solution to structure from motion using the affine camera model, 
the Euclidean structure and motion parameters fitting the weak perspective 
camera model can be recovered m- A result of choosing the first frame to be 
canonical is that the structure vectors have the form, = (x^ (1), Z)^ , and we 
can derive the relationship, 

. ( 1 ) 

This relationship effectively describes the epipolar constraint between two ima- 
ges. 

Solving for Image Intensity 

Assuming a static camera and light source, we consider the intensity Ii{j) of the 
point on the surface of a moving object projected into the image. For 
Lambertian surface we can then express Ii{j) in terms of the image formation 
equation process so that 

liU) = max(b As(j), 0) (2) 

where the maximum operator zeroes negative component The 3 vector 
is the product of the albedo with the inward facing unit normal for the point 
and the 3 vector s{j) is the product of the strength of the light source with 
the unit vector for its direction. The negative components correspond to the 
shadowed surface points and are sometimes called attached shadows [1 4] . To 
have no attached shadows, images must be taken with the light source in the 

bright cell (the cell of light source directions that illuminate all points on the 

object 0). Note that 

s(j) = R(j)^s(l) (3) 

where the 3x3 matrix, R(j), is the rotation of the object from the first canoni- 
cal frame to the frame. Multiplication of R(j)^ represents virtually inverse 
rotation of the light source. The rotation matrix is directly computed from the 
2x3 matrix M(j) that is given above by solving for the structure from motion 
problem. 

Given the correspondence for a small number of Ui pixels through all of nj 
images, we record the corresponding pixel intensities, in I, which we call 

illumination matrix. Then, we can form the matrix equation 

I = BS (4) 

where I is a x Uj matrix containing the elements B is a x 3 matrix 

containing the rows b^, and S is a 3 x rij matrix containing the columns s(j). 
Equation 0 is then in the familiar form for solution by singular value decompo- 
sition to obtain a rank 3 approximation to the matrix I such that 



I = BS . 



( 5 ) 
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In practice we employ RANSAC, a robust random sampling and consensus tech- 
nique, to ensure that artifacts that are caused by an object not fulfilling the 
assumed conditions (e.g. specularities and self-shadowing) do not distort the 
correct solution. 

As is well known, the solution is unique up to an arbitrary invertible 3x3 
transformation A and equation 0 is equivalent to 

I = BS = (BA^i)(AS) . (6) 

Therefore, S by arbitrary decomposition contains the information of the light 
source up to an arbitrary transformation A. Although the matrix S will suf- 
fice for surface reconstruction with a single light sources, the parameterization 
using the lighting vectors is useful in an environment with multiple light sources. 

Estimating the Light Source Direction 

The problem of estimating the light source direction is attributed to that of 
solving for matrix A which transforms S into S by 

S = AS , (7) 

or for each column s(j) into s(j) by 

s(j) = As(j) (8) 

where s{j) denotes columns of S. Substituting equation into equation 0 we 
have the relation. 



As(j) = R(j)^As(l) , (9) 

which provides a homogeneous system; three equations for each reference image. 
In order to solve for the nine elements of A as a well-posed problem, it is neces- 
sary to have reference information so that the resulting light source direction fits 
to the first canonical input frame. Observing at an arbitrary sample point i 
for which the light source is in the bright cell throughout the input images, and 
substituting equation El to equation El for this point, we have 

I,{j)=hjAs{j) . (10) 

Since s(j) is obtained by the singular value decomposition of matrix I and the set 
of intensities, is also known, equation El provides an additional constraint 

to matrix A for each reference image. With the constraints of equation 0and 
equation El matrix A can be solved using a minimum of three input images. 
Once matrix A is solved, by multiplying A with S we obtain the explicit light 
source matrix S containing the columns s(j). 
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Camera ^ y' 



Fig. 1. Geotensity constraint. The intensity of world point Xi projected into the first 
image, is represented by a unique linear combination of the intensities of the 

same point projected in the other three images, A (2) • ■ • 7i(4) for all points i. 

2.2 Depth by the Geotensity Constraint 

The term Geotensity constraint accounts for a constraint between four or more 
images of an object from different views under static lighting conditions. This 
concept is schematically depicted in Figure ^ by replacing object motion with a 
coherent motion of the camera and the light source. 

The conditions for applying the Geotensity constraint to depth reconstruc- 
tion are as follows: (i) The scene consists of a single moving object, (ii) The object 
has Lambertian surface properties and is convex (therefore no self-shadowing) 
while the surface may or may not be textured, (iii) There is a single distant light 
source. However, the condition (iii) will be relaxed in Section 0 and 0 

Evaluating the Set of Intensities 

At each pixel, x, in the first image, to search for the depth Z we can recall 
equation n for the geometric constraint imposed on a sequence of images so that 



/(j; X, Z) indicates the set of image intensities in the frame at the coordinates 
determined by x in the first image, guess of depth Z, and the motion parameters 
M(j) and t(j). The task is now to evaluate the set of intensities /(j; x, Z). When 
full Euclidean lighting conditions have been recovered in advance so that s(j) is 
known, with four images we define as 



where S is a 3x4 matrix containing the columns s(j)(j = 1, 2, 3,4). For a single 
light source with all images taken with the light source in the bright cell, the 
estimated values of the intensities are then 




( 11 ) 



b^ = [/(I) 7(2) 7(3) 7(4)] S^(SS^)"^ 



( 12 ) 



Aj;x,^) = b^s(j) . 



(13) 
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It should be noted that exactly the same estimation of /(j;x, Z) is available 
also in the case that the light source direction is determined only up to the 
ambiguity. This is easily confirmed by substituting equation Q to equation ca 
and then equation to where matrix A turns out to be canceled. Estimating 
j(j; X, Z) by equation we can define the error function to evaluate the set of 
intensities I{j;x,Z) as 

E(x, Z) = X, Z) - I{j; x, Z)f . (14) 

3 



Computing the Depth 

At each pixel, x, in the first image we measure the error, E, in the Geotensity 
constraint at regular small intervals of depth, Z. When the depth is correct we 
expect the error to approach zero and when it is incorrect we expect the error 
to be large. The Geotensity constraint can be stated simply as 

E(x,Z) = 0. (15) 

It is clear that as the depth parameter is varied the location of the corresponding 
points in each image will trace out the corresponding epipolar line in each image. 
We then choose such depth Z that minimizes the error E(x, Z) as the depth 
estimate. 

3 Estimation of Multiple Light Sources 

Our discussion in the previous section has been limited largely to the case of 
a single light source. In this section, we extend this by describing a scheme to 
compute surface depth using the Geotensity constraint for the case of multiple 
light sources. If all the light sources illuminated all the points on the surface of 
the object then we could treat the combined light sources as an equivalent single 
light source and carry out computation of this vector from measurements as 
before. Unfortunately, most (if not all) points on the surface will be illuminated 
by a subset of the light sources and this subset will be different for different 
points. Therefore, the illumination matrix I will contain intensities derived from 
different subsets of light sources. 

The basic idea we propose for the case with multiple light sources is to 
first sort the rows of the illumination matrix I into submatrices, each of which 
contains intensities derived from an identical subset of light sources. We call each 
such submatrix an illumination submatrix. Once the segmentation is carried out, 
we estimate the direction of combined light sources for each subset, applying the 
technique described in Section o individually to each illumination submatrix. 

3.1 Illumination Submatrix 

Although we describe the algorithm mainly for the case of two point light sour- 
ces for simplicity, the proposed algorithm will turn out to be applicable to the 
general case of an arbitrary unknown number of light sources. For illustration 
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Fig. 2. Synthetic sphere (128 x 128). Left: The first point light source is at infinity in 
the same direction to the viewing point. Middle: The second in the top- left direction 
(0.37, - 0.92, -0.18). Right: The sphere is illuminated by both of the point light sources. 

we use the synthetic sphere shown in Figure El The first point light source illu- 
minating the entire semi-sphere is placed at infinity in the same direction to the 
viewing point, whereas the second is also at infinity but in the top-left direction 
causing an attached shadow. For convenience we call the lighting vectors li and 
I 2 , respectively, and the combined lighting vector Iq. I.e, Iq = li -I- I 2 . Thus Iq 
illuminates the top-left area while the rest of the area is illuminated by li only. 
In this example there is no area that is illuminated only by I 2 and we ignore it 
without loss of generality. 

Suppose that uq points out of samples stay illuminated by Iq throughout 
the image sequence and rii points only by li. And given that somehow we know 
the classification of sample points (the method for the classification is described 
in Section l,3.2|l . we could permute the rows of illumination matrix I in such a way 
that each illumination submatrix I;(Z = 0, 1) respectively contains ni{l = 0, 1) 
sample points due to each lighting vector. Analogous to equation0in the case of 
a single point light source, each illumination submatrix could then be rewritten 
as 

hc^BiSi (1 = 0,1) (16) 

and the illumination matrix can be represented in its canonical form I: 




where riixO matrices B; are submatrices of B containing the rows corresponding 
to sample points illuminated by 1 ;. For convenience we use the notations of 

® B.) • § = (i;) ^ 

In the following we discuss the acquisition of the above representation I, which 
essentially requires the knowledge about the classification of the sample points 
so that each row of I/(/ = 0, 1) contains image intensities generated by a com- 
mon subset of light sources. Now equations El and [Q are not strictly defined as 
equations because each sample point does not necessarily stay illuminated by a 
unique subset of light sources throughout the input frames. However, assuming 
that a majority of the sample points will do, we first include those points il- 
luminated by different sets of light sources in different frames in either of the 
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illumination submatrices, and then exclude them as outliers by RANSAC prior 
to the process of factorization in equation m 

3.2 Segmentation of Illumination Subspace 

As is obvious in equation II 71 Ig and Ii are at most rank 3 and thus full rank 
of I or unsorted I is in total 6. In order to consider the non-degenerate case, 
we assume Uj > 6, i.e. we utilize a minimum of 6 frames. When we consider 
an arbitrary row of I, pj , corresponding to an arbitrary sample point, a, and 
regard it as representing an rij dimensional vector whose elements are the pixel 
intensities of point a observed throughout rij frames, vector pj represents a 
point in nj dimensional illumination space . Therefore, if Po, belongs to Ig, 
we have 

Pl=bISg (19) 

where p(!( represents a point in 3-D illumination subspace £g spanned by the 
three row vectors of Sg with the surface vector as coefficients. Analogously, 
each row of Ig represents a point in a 3-D subspace £g, and each row of Ii a 
point in another 3-D subspace C\. Therefore, the classification problem of rii 
sample points is attributed to segmenting a group of rii points in into two 
different 3-D subspaces. In the following, we first show that the segmentation is 
mathematically possible on the basis of a theorem, and then propose a technique 
to practically carry out the segmentation by introducing the surface interaction 
matrix. 

Mathematical Background 

In the illumination space under two point light sources we define an ni x Ui 
metric matrix G = (Gq,/?) at0 

G = II^, G„;3 = (p„,p^) (20) 

where a and P are indices of two arbitrary sample points, and Pa and p/j are 
corresponding rows of I. By definition this is a rank 6 positive semi-definite 
symmetric matrix 0. We denote the eigenvalues of G as Ai >,...,> A„^ (only 
the first six values are non-zero) and an orthonormal systenj^ of corresponding 
eigenvectors as {vi, ..., v„^}. Let us define an x function matrix H = (iL^/j) 
with the first six eigenvectors as 

H = (21) 

1=1 

Then, according to the general theorem for subspace segmentation (see Ap- 
pendix) we have the following theorem as a special case where the number of 
subspaces is two (m = 2); 

^ We write the inner product of vectors a and b as (a,b). 

® By orthonormal we mean that Vi • =0 (* 7 ^ j) and Vi - Vj = 1 [i = j) for arbitrary 

set of i and j. 
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Theorem 1 If p„ G Iq and p/j G Ii, then Hap = 0. 

Theorem 1 implies the possibility of the classification, i.e. for each pair of sample 
points, a and /3, we can judge if they are illuminated by an identical subset of 
light sources by computing Hap- If the value is non-zero, they belong to the 
same subset of light sources, and if they do not belong to the same subset of 
light sources, the value is zero. 

Surface Interaction Matrix 

Based on Theorem 1 we propose to carry out the task of classification systema- 
tically for the entire set of sample points. Applying singular value decomposition 
directly to the unsorted illumination matrix, I, we obtain the familiar form of 
approximation as I = USV^. Matrix S is a diagonal matrix consisting of the 
six greatest singular values whereas U and V are the left and right singular ma- 
trices, respectively, such that U^U = V^V = Egxe (the identity matrix). Since 
U contains eigenvectors of I I^, we can compute the rii x function matrix H 
using its first six columns, 

H = UU^, (22) 

and we call it the ’’surface interaction matrix’’^ as it preserves the interactive 
property of object surface with the light source direction. This matrix H is by 
definition computable uniquely from the illumination matrix I. Also, permuting 
rows of I does not change the set of values Hap that appear in H though their 
arrangement in H does; swapping rows a and /3 of I results in swapping corre- 
sponding rows a and /3 of U. Therefore, it results in simultaneously swapping 
rows a and j3 and columns a and /3 in H, but not their entry value. 

Since the set of values does not change, to reveal the relevance of Theorem 1, 
we investigate the character of H, the surface interaction matrix for the canonical 
illumination matrix I. Factorizing each illumination submatrix li{l = 0, 1) of I 
in equation El by singular value decomposition in a similar way to above, we 
have 

I, = Vi-EiVj = {i = o, i) (23) 

where A; represents an arbitrary invertible 3x3 matrix. Denoting also 




we have another factorized form of I as 

I = (US1/2a)(A^^S1/2yT) ^ (24) 

Comparing the first term of equation]^ to that of equation im we obtain 

B = . (25) 

The definition has been inspired by the work of Costeira and Kanade 0 that has 
proposed the ’’shape interaction matrix” and applied it successfully to the multi- 
body structure-from-motion problem. 
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Substituting the relation in equation E3to 22, 



H = UU^ 

= B(A^SA)-iB^ 

= B(B^B)-^B^. 



Further, substituting equation El to E3 



H = 



io 



WBo)-i 



Bn 



'Bo(B^j Bo)-iBj 






'BJ 






Bi(B^Bi)-iBj; ■ 



(26) 



(27) 



This means that the canonical H matrix to the sorted I has a very defined 
block-diagonal structure as can be expected by Theorem 1. Thus, each block of 
H provides important information; sample points illuminated by a common set 
of light sources belong to an identical block in H. 



Segmentation in Practice 

The problem of segmenting the sample points under different subsets of light 
sources now has been reduced to sorting the entries of matrix H into H by 
swapping pairs of rows and columns until it becomes block diagonal. Once the 
sorting is achieved, the corresponding permutation of rows of I will transform 
it to its canonical form, I, where sample points under common subset of light 
sources are grouped into contiguous rows. We can then derive the light source 
matrix Sq and Si from each submatrix of I independently by the same technique 
used in the case of a single light source. 

In practical segmentation, with presence of points illuminated by different 
sets of light sources in different frames, perfect diagonalization may not be pos- 
sible. As stated earlier, however, it is not crucial either since those points will 
anyway be excluded by RANSAC in computation of lighting parameters. It is 
important that the borders between the blocks are roughly found so that each 
block contains a sufficient number of sample points to compute the lighting para- 
meters. With noisy measurements, a pair of sample points under different subsets 
of light sources may exhibit a small non-zero entry in H. We can regard as 
representing the energy of the surface interaction, and the block diagonalization 
of H can be achieved by minimizing the total energy of all possible off-diagonal 
blocks over all sets of permutations of rows and columns of H. A simple iterative 
minimization procedure suffices for our purpose, but more efficient approaches 
such as using the hill-climbing method can be employed as proposed in P| for 
sorting the shape interaction matrix. When the search over the set of permu- 
tations is explosive, an approach based on genetic algorithm can be efficiently 
adapted. 
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3.3 Generalized Segmentation of Illumination Subspace 

The discussion for the two points light sources applies to the case with an ar- 
bitrary number of light sources (see Appendix). In the general case where m 
subsets of the light sources exist, basically a minimum of 3m frames will be 
necessary. Although the number of the blocks, namely the number of illumina- 
tion subspaces, m, that reside on the surface of the object will be required prior 
to the segmentation, it can also be identified by computing the rank r of the 
illumination matrix I since r = 3m. 

Regarding the condition to have non-degenerate I, the sample points in each 
lighting classification should have surface normals that span 3D. Also, the rota- 
tion axis with respect to the object motion should not coincide with the light 
source direction, nor be coplanar throughout the input frames. 

4 Surface Reconstruction 

In this section we describe our algorithm for dense surface recovery under mul- 
tiple light sources. With the light source matrix So and Si acquired using the 
technique discussed in the previous section, we can exploit equations II 21 mi a.nd 
rm Although So and Si are valid as they are insofar as the surface stays illumi- 
nated by an identical set of light sources throughout the image sequence, it is 
not the case in general. The fundamental difficulty is in dealing with the subset 
of light sources that is not constant throughout the frames, different for each 
surface point, and unavailable in advance. For each surface point, i, neverthe- 
less, a specific light source matrix must exist and be formed by an appropriate 
combination of Sq and Si. We denote such a matrix Sb Each column of S* is 
either from So or Si depending on which subset of light sources illuminates point 
i in each frame. Hence, the number of possible candidates for S® is m"^' where m 
is the number of the possible subsets of the light sources. In order to search for 
the depth Z, correct error in the Geotensity constraint defined by equation 
should be measured with the light source matrix Sb Since multiple candidates 
exist for S®, we measure the error using all the possible candidates at each image 
point, X. As in the case of a single light source, we expect the error to approach 
zero when the depth is correct and the error to be large when the depth is in- 
correct. We also expect the error to be large when an incorrect candidate of S® 
is used. Because we compute the error for all the candidates of S® at regular 
small intervals of depth, we regard the smallest error as E{x,Z) and choose as 
the depth estimate such depth Z that minimizes the error E{x,Z). Including 
the steps to compute the light source matrices, the algorithm for estimating the 
depth can be summarized as: 

1° Decompose the illumination matrix I using SVD and yield I = USV^ 
and compute the rank r = 3m . 

2° Using the first r columns of U, compute the surface interaction matrix 
H = UU^ and block-diagonalize it. 

3° For each of m blocks in H, permute matrix I into submatrices, I;(^ = 

1, ...,m), and compute S; accordingly. 
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4° At point X, measure I{j\ x, Z) using equation a particular guess 
of depth Z. 

5° Estimate with I{j-,x,Z) by equation considering all the com- 
binations of S; for S, and then I(j;TC,Z) by equation 1131 
6° Computing the error E{x, Z) by equation ^ choose such depth Z 
that minimizes E(x,Z) as the depth estimate. 

5 Experiments 

In order to investigate the performance of the proposed algorithms, we first 
utilize the synthetic sphere shown in Figure ^ as an example of an input object 
that satisfies the required assumptions. The synthetic sphere is illuminated by 
two point light sources at infinity. The upper-left part is illuminated by both 
of the light sources whereas the rest of the surface is illuminated only by one 
of them. Since the appearance of the sphere would not change as it rotates, we 
acquire several identical input images. As we need some point correspondence 
to solve for the parameters of imaging geometry as well as the lighting, in the 
simulation we give some preliminary sample points arbitrarily on the sphere as 
shown in Figure 0 and map the coordinates from one frame to the other while 
giving some rotation to the sphere. 




Fig. 3. Synthetic sphere (128 x 128) illuminated by two point light sources at infinity. 
Corresponding coordinates are marked by the cross. 4 images are shown out of 8 used. 

Figure 2|shows the surface interaction matrix H computed for the 46 sample 
points on the sphere surface and its diagonalised form H. It is observed in the 
diagonalized form that the sample points can be divided roughly into two clu- 
sters. It is also observed that the result involves non-zero off-diagonal elements 
due to some sample points which do not exactly belong to either of the clusters 
while illuminated by different sets of light sources in different frames. However, 
those points are excluded as outliers by RANSAC. In this example, a total of 
17 points are excluded whereas the entire set of points are divided into 21 and 
25 points in each cluster. From each cluster we then compute the light source 
direction illuminating the corresponding area of the sphere surface. 

The surface structure of the sphere was computed by measuring the depth at 
every pixel in the first image. For illustration a vertical scan line in the middle 
of the sphere {x = 64) is first examined in detail. We compute the error E{Z) by 
equation El at each assumed depth Z for each pixel on the scan line to generate 
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Fig. 4. Left: The surface interaction matrix H before sorting. Right: Diagonalized 
surface interaction matrix H. The lighter the gray scale, the greater is the value. 




(a) (b) (c) (d) (e) 



Fig. 5. The error map by the Geotensity constraint. The vertical axis corresponds to 
that in the input image and the horizontal axis to the assumed depth Z. The darker 
the gray scale is, the smaller is the error E{Z). The gray scale is histogram-equalized, 
(a) Computed only using sq. (b) Computed only using Si. (c) Computed using both So 
and Si. (d) Computed taking all the combination of columns in So and Si into account, 
(e) The surface depth map. The lighter, the closer. The estimate in the extreme-right 
part should be ignored since due to the rotation that part of the sphere is invisible in 
most of the input image sequence. 



an error map for the line. Figure El shows the resulting error map where Sq and 
Si are utilized interchangeably in order to clarify their relevance. The darker 
the image intensity, the smaller is the error E{Z). The gray scale is histogram- 
equalized so that regions with smaller error can be studied more easily. It should 
be noted that E{Z) has been computed without using a template and only by 
referring to a pixel intensity in each frame. By searching for the depth Z along 
the horizontal axis to find the point where the error E{Z) approaches zero and 
then tracing this Z through each vertical coordinate, a curve that reflects a 
continuous estimation of depth should be generated. 

Figure El (a) and (b) shows maps of E{Z) computed using only So or Si, 
respectively. Correct shape of the surface is estimated only for the upper part of 
the scan line in FigureEl (a), and in (b) only for the lower part. This is expected 
since Sg and Si are valid as they are only where the surface stays illuminated by 
the corresponding set of light sources. Figure El (c) shows another map obtained 
using both So and Si and choosing the smaller E{Z). A reasonable result is 
obtained for the upper and lower part of the scan line where either So or Si is 
valid. However, the result for the part in between those two parts is not as desired 
since the correct set of light sources for this part should be some combination 
of So and Si. Finally, Figure El(d) shows the map obtained by considering all 
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(h) (b) (c) (d) 



Fig. 6. Statue of Julius Caesar, (a) Illuminated by a point light source in the viewing 
direction, (b) Illuminated by another point light source placed at the right-hand side, 
(c) Illuminated by both of the point light sources, (d) Illuminated by both of them but 
in a different pose. 




(a) (b) (c) (d) 

Fig. 7. 3-D reconstruction of Julius Caesar, (a) The depth map by the Geotensity 
constraint. The lighter, the closer, (b) The depth map by correlation, (c) The recovered 
surface by the Geotensity constraint shown with mesh, (d) The recovered surface as in 
(c) but the surface is texture-mapped. 



the combinations of So and Si as the candidates of the light source matrix S. 
Although the error map appears to be more noisy since different combinations 
of So and Si are taken into account, the error is minimized at correct depth 
for the entire vertical coordinates. Figure 0 (e) shows the result obtained by 
performing the procedure for the entire surface. In the extreme-right part of 
the sphere the estimation is not properly obtained since that part of the sphere 
does not stay visible throughout the input images due to its rotation. Also, some 
noise appears where an incorrect combination of Sg and Si happens to minimize 
the error E{Z) at incorrect depth. As a whole, however, it is observed that the 
sphere surface is effectively recovered by investigating the Geotensity constraint 
under multiple light sources. 

We have applied the scheme also to a statue of Julius Caesar shown in Fi- 
gure El The images of Julius Caesar were taken under two point light sources 
placed in different directions, one in the viewing direction and the other at the 
right-hand side, both about two meters away. Figure El (a) and (b) illustrate the 
image radiance due to each point light source whereas Figure El (c) shows that 
produced by both of the light sources. Seven other images of the statue, each 
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obtained in a different pose, were used, and thus, a total of eight images were 
used in the experiment. FigureEI(d) shows one of them. As can be observed, one 
of the light sources only illuminates part of the statue and the part varies due 
to the motion of Julius Caesar. For instance the central part of the forehead is 
illuminated by both of the light sources in Figure (c) but by only one of them 
in Figure |^(d). In this way the appearance of the statue changes dramatically, 
making the correspondence problem very difficult for conventional methods such 
as using a constant intensity constraint. 

The resulting depth map computed with the Geotensity constraint is shown 
in Figure E] (a). For the purpose of comparison, a depth map computed in the 
same framework but using cross-correlation is shown in Figure Q (b). In both 
cases we used a 15x15 template for the search to suppress the error arising 
from the image noise. Obviously, less error is involved in the estimation by the 
Geotensity constraint, especially in the area where the surface has little texture 
such as around the forehead and cheek. It implies the advantage of the Geotensity 
constraint that basically works regardless of the surface albedo. Figures 0 (c) 
shows recovered surface with mesh to assess the depth accuracy. The recovered 
surface is smoothed using Gaussian operator (with standard deviation cr = 1.0). 
Figures0(d) shows the surface with texture. The result demonstrates the validity 
of the proposed scheme for a real object. 

6 Summary and Discussion 

For the problem of 3D object surface reconstruction by a single static camera, 
we have considered to extend the Geotensity constraint to the case of multiple 
light sources. We have first shown that it is mathematically possible to sort the 
sample points into different clusters according to the subset of relevant light 
sources. Introducing the surface interaction matrix as a technique to carry out 
the task of practical segmentation, we have proposed that the object surface be 
computed by solving for the combined light source direction from each cluster 
and taking the combinations of different sets of light sources into account. Alt- 
hough the demonstration has been limited to the case of two point light sources, 
the algorithm is in principle applicable to the general case of an arbitrary un- 
known number of light sources. When a large number of subsets of light sources 
is involved in the presence of noise, a practical mechanism would be required in 
order to estimate the clustering of sample points and also reduce the search for 
the combination of light sources. Future work will be directed at developing such 
a mechanism, e.g. by employing a statistical approach for the grouping of points 
as proposed in pj, and also at extending the algorithm to treat more general 
lighting conditions, including shadowing or inter-reffection. 

A Appendix: Subspace Segmentation 

Consider N points Pa in n-dimensional space R" {a = and decompose the 

group of indices I = {1, ..., 77} into m subgroups as: 

hu - ■ ■ulm=R, hn - ■ ■nim = 0. 
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Define a rank rk subspace Ck spanned by the fc-th group Pa{a & Tk). When m subspa- 
ces Ck{k = 1, are linearly independent, Ci(B- ■■(BjC.m represents an r-dimensional 

subspace of so that r = Assuming N > r, let us define an x metric 

matrix G = (Gap) as 

Gap (Po:5P/3) 5 (2^) 

where (/3 = 1, N) is an index to another sample point pp in R^. By definition this 
is a rank r positive semidefinite symmetric matrix [S|. We denote the eigenvalues as 
-^1 (only the first r values are non-zero) and an orthonormal system of 

corresponding eigenvectors as {vi, vjv}- Let us define an N x N function matrix 
Q = (Qap) as 

r 

Q = 

i=l 

Theorem 2 If a € Tfc and P ^Tk, then Qap = 0. (See Kanatani |S] for the proof.) 
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Abstract. A new definition of affine invariant skeletons for shape re- 
presentation is introduced. A point belongs to the affine skeleton if and 
only if it is equidistant from at least two points of the curve, with the 
distance being a minima and given by the areas between the curve and 
its corresponding chords. The skeleton is robust, eliminating the need 
for curve denoising. Previous approaches have used either the Euclidean 
or affine distances, thereby resulting in a much less robust computation. 

We propose a simple method to compute the skeleton and give examples 
with real images, and show that the proposed definition works also for 
noisy data. We also demonstrate how to use this method to detect affine 
skew symmetry. 

1 Introduction 

Object recognition is an essential task in image processing and computer vision, 
being the skeleton or medial axis a shape descriptor often used for this task. 
Thus, the computation of skeletons and symmetry sets of planar shapes is a 
subject that received a great deal of attention from the mathematical (see jSl 
El and references therein), computational geometry [2nj, biological vision [1 4| 
ESI, and computer vision communities (see for example [1 iSf‘21 ) and references 
therein). All this activity follows from the original work by Blum P). 

In the classical Euclidean case, the symmetry set of a planar curve (or the 
boundary of a planar shape) is defined as the set of points equidistant from at 
least two different points on the given curve, providing the distances are local 
extrema (a number of equivalent definitions exist). The skeleton is a subset of 
this set. 

Inspired by pmi and HD- we define in this paper an analogous symmetry 
set, the ajfine area symmetry set (AASS). Instead of using Euclidean distances 
to the curve, we define a new distance based on the areas enclosed between the 
curve and its chords. We define the symmetry set as the closure of the locus of 
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points equidistant from at least two different points on the given curve, provided 
that the distances are local minima. 

As we will demonstrate, this definition based on areas makes the symmetry 
set remarkably noise-resistant, because the area between the curve and a chord 
“averages out” the noise. This property makes the method very useful to compute 
symmetry sets of real images without the need of denoising. In addition, being 
an area-based computation, the result is affine invariant. That implies that if 
we process the image of a planar object, the skeleton will be independent of the 
angle between the camera that captures the image and the scene, provided the 
camera is far enough from the (fiat) object. For these reasons, we believe this 
symmetry set, in addition to having theoretical interest, has the strong potential 
of becoming a very useful tool in invariant object recognition. 

2 AfRne Area Symmetry Set and AfRne Invariant 
Skeletons 

In this section, we formally introduce the key concepts of affine distance^ affine 
area symmetry set and affine skeleton. We begin with the following definition: 
Definition 1: A special affine transformation in the plane (R^) is defined as 

X = AX + B, (1) 

where X € is a vector, A S SL 2 (R^) (the group of invertible real 2x2 
matrices with determinant equal to 1) is the affine matrix, and B G R^ is a 
translation vector. 

In this work we deal with symmetry sets which are affine invariant, in the 
sense that if a curve C is affine transformed with Eq. Q, then its symmetry 
set is also transformed according Eq. Q. Affine invariant symmetry sets are not 
new. Giblin and Sapiro gurn introduced them and proposed the definition of the 
affine distance symmetry set (ADSS) for a planar curve C(s). In their definition, 
they used affine geometry, and defined distances in terms of the affine invariant 
tangent of C(s), which involves second order derivatives of the curve respect to an 
arbitrary parameter. Later they defined the ADSS analogously to the Euclidean 
case, there is a fundamental technical problem with this definition: in curves 
extracted from real images, noise is always present, and the second derivatives 
needed to compute the ADSS oscillate in a very wild fashion unless a considerable 
smoothing is performed. Giblin and Sapiro proposed a second definition, the 
affine envelope symmetry set pmn, which still requires derivatives, and thus 
suffers from the same computational problem. As we will show with the new 
definition presented below, we do not compute derivatives at all. Gonsequently, 
the robustness of the computation is considerably higher, and shape smoothing 
is not required. 

For simplicity, here we shall always deal with simple closed curves (^(s) : 
[0, 1] — >■ R^ with a countable number of discontinuities on the derivative C'{s). 
We start by defining the building block of our affne area symmetry set (AASS), 
the affine distance (inspired by jlYjh 



744 



S. Betelu et al. 



Definition 2: The ajfine distance between a generic point X and a point of the 
curve C{s) is defined by the area between the curve and the chord that joins 
C{s) and X: 

1 fC{s') 

d{X,s) = - {C-X)xdC (2) 

^ Jc{s) 

where x is the z component of the cross product of two vectors Q the points 
(^(s) and C(s') define the chord that contain X and that has exactly two contact 
points with the curve, as shown in Fig.^-A. This distance is invariant under the 
affine transformation Eq. (^, and it is independent on the parametrization of 
the curve. 

For a simple convex curve, the function d{X, s) is always defined for interior 
points, but for concave curves the function d{X, s) may be undefined for some 
values of s in [0, 1] as sketched in Fig. JD-B). When the point is exterior to the 
curve (as the point Y in Fig. JD-A)), the distance may be undefined as well. 

We can now define our affine symmetry set: 

Definition 3 : X G is a point in the affine area symmetry set of C{s) (AASS) 
if and only if there exist two different points si,S 2 which define two different 
chords that contain X and have equal area, 

d{X,si) = d{X,S 2 ), (3) 

provided that d{X, si) and d{X, S 2 ) are defined and that they are local minima 
respect to s. 

This definition is analogous to the Euclidean case and the ADSS in jnfmj . 

Commonly in shape analysis, a subset of the symmetry set is used, and it is 
denoted as skeleton or medial axis. In the Euclidean case, one possible way to 
simplify the symmetry set into a skeleton, inspired by original work of Giblin 
and colleagues, is to require that the distance to the curve is a global minimum. 
The affine definition is analogous: 

Definition 4: The affine skeleton or affine medial axis is the subset of the AASS 
where d{X,si) and d(A', S 2 ) scce global (not just local) minima. 

There are curves for which the skeleton may be computed exactly. For ex- 
ample, it is easy to verify that the skeleton of a circle is its center. We can make 
an affine transformation to the circle and transform it into an ellipse, and by 
virtue of the affine invariance of our definitions, we conclude that the skeleton of 
an ellipse is its center too. Another important example is the triangle. In affine 
geometry, all triangles may be generated by affine-transforming an equilateral 
triangle. The skeleton of an equilateral triangle are the segments connecting the 
middle of the bases and the center of the figure (as can be verified by simple area 
computations). As a consequence, the skeleton of an arbitrary triangle are the 
segments connecting the middle of the sides with the center of gravity (where 
all medians cross). 



^ We should note that our distance is the integral of the distance used in prm| . 
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An important property concerns convex curves with straight sections. If the 
curve contains two equal straight parallel non-collinear segments, then the AASS 
will contain an identical segment parallel and equidistant to the former two. For 
instance, the AASS of a rhombus will contain its medians. 

2.1 Dynamical Interpretation and AfRne Erosion 

We now define a notion of affine erosion of a curve, which is analogous (but 
different) to that used by Moisan ^Zj. Let C{s) : [0, 1] — >■ be a simple closed 
curve. Let (C(s), C(s')) be a chord of C(s), as shown in Fig. ^A. As before, this 
chord intersects the curve exactly twice The difference of this definition with 
respect to the definition given in m, is that in Moisan’s set-up the chord is 
allowed to cross the curve outside the interval (s, s'). For our purposes, we opted 
for a definition which gives a unique chord for a given parameter s, when this 
chord exists, and which also separates the curve into two disjoint parts. The 
connected closed set enclosed by the chord and the curve is the chord set, and 
its area is denoted by A. 

Definition 5: The minimum distance from a point X to the curve C is defined 
by 

f{X) = inf{d{X,s),seD) (4) 

where D is the domain of d{X, s) for a fixed value of X (see below for conditions 
for this for convex curves). 

For interior points X, the minimum distance f{X) is always defined because in 
a simple closed curve, we can always draw at least one chord that contains X. 
For exterior points it may be undefined, as for instance, in a circle. 

Moisan defines the affine erosion of the curve C as the set of the points of 
the interior of C which do not belong to any positive chord set with area less 
than A. Here we define the affine erosion in terms of our affine distance: 
Definition 6: The affine erosion E{C,A) of the shape enclosed by a curve C, 

by the area A, is the set of the points X of the interior of C that satisfy 

E{C, A) := {A G R2 : f{X) > A > 0} (5) 

Roughly speaking, it is the area bounded by the envolvent of all the possible 
chords of area A > 0. We also define the eroded curve C(A) as the “boundary” 
of the affine erosion 



C(A) := {A G R2 : /(A) = A}. (6) 

Note that if we consider the area to be a time parameter, t = A, the distance 

/(A) represents the time that the eroded curve C(A) takes to reach the point 
A. Initially, when A = 0, we have the initial curve. At later times (A > 0), the 
curve C(A) will be contained inside the original curve. 

There is a fundamental relationship between the affine erosion of a curve and 
the skeleton, namely, a shock point A is an skeleton point. 
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Definition 7: A shock point A is a point of the eroded curve C{A) where two 

different chords (C(si), and (C(s 2 ), C(s 2 )) of equal area A intersect. 

Clearly, the distance from X to these two points are equal, d{X, si) = d{X, S 2 ), 

and on the other hand, the distance is a global minimum at si because as X 
belongs to C{A) (see definitions 5 and 6), f{X) = inf(d(A, s),s £ D) = A. 
Thus, a shock point X is an skeleton point. 



2.2 Basic Properties of the Affine Area Distance and Symmetry Set 

In this section we shall present some theoretical results concerning the AASS, 
mostly without proof. Further results and details will appear elsewhereH We shall 
show that the AASS has connections with our previously defined affine envelope 
symmetry set or AESS mB as well as with the affine distance symmetry set 
(ADSS) mentioned before. As mentioned above, it is not clear to us how far the 
area definition can be extended. That is, it is not yet clear whether we can define 
a smooth family of functions 



d: CxU -)-R, 

associating to (s,A) the ‘area d{s,X) of the sector of C determined by C{s) 
and X’, for all points X inside some reasonably large set U in the plane R^. We 
have seen that when C is convex then the function is well-defined and smooth 
for all X inside the curve C. We shall assume this below. 

We have 



/■*(«) /■*(*) 

2d{s,X)= / [C(s) - A,C"(s)]ds = / F{s,X)ds, 

J S j s 

for any regular parametrization of C, where [ , ] means the determinant of the 
two vectors inside the square brackets, ' means ^ and t{s) is the parameter 
value of the other point of intersection of the chord through X and C{s) with 
the curve. Using standard formulae for differentiation of integrals, 

3d 

2ds :=2^ = E(t(s),A)t'(s)-F(s). 

We evaluate t'{s) by using the fact that C{s),X and C(t(s)) are collinear: 
C(t(s)) — X = A(C'(s) — X) for a scalar A, i.e. [C(s) — X, C(t(s)) — A] = 0. 
Differentiating the last equation with respect to s we obtain 

t'(s) = [C(t(s)) - A, C'(s)]/[C(s) - A, C'(t(s))], 

^ Some of the results below have also been obtained by Paul Holtom m 
® This set is basically defined as the closure of the center of conics having three-point 
contact with at least two points on the curve. 
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and from this it follows quickly that, provided the chord is not tangent to C at 
C{s) or C{t{s)), ds = 0 if and only if A = ±1. But A = 1 means C{s) and C(t(s)) 
coincide so for us the interesting solution is A = — 1, which means that X is the 
mid-point of the segment (this result has also been obtained in 1171 1: The area 
function d has a stationary point at (s, X), i.e. dd/ds = 0, if and only X is the 
midpoint of the segment from C(s) to C{t{s)). 

One immediate consequence of this is: The envelope of the chords cutting ojf 
a fixed area from C (the affine eroded set) is also the locus of the midpoints of 
these chords. 

Further calculations on the same lines show that: The first two derivatives of 
d with respect to s vanish at (s,X) if and only if X is the midpoint of the chord 
and also the tangents to C at the endpoints of the chord are parallel. 

In mathematical language this means that the ‘bifurcation set’ of the family 
d is the set of points X which are the midpoints of chords of C at the ends 
of which the tangents to C are parallel. This set is also the envelope of lines 
parallel to such parallel tangent pairs and halfway between them, and has been 
called the midpoint parallel tangent locus (MPTL) by Holtom m- Various facts 
are known about the MPTL, for example it has an odd number of cusps, and 
these cusps coincide in position with certain cusps of the AESS nn. The cusps 
in question occur precisely when the point X is the center of a conic having 
3-point contact with C at two points where the tangents to C are parallel: The 
first three derivatives of d with respect to s vanish at (s,X) if and only if X is 
the midpoint of the chord, the tangents to C at the endpoints are parallel, and 
there exists a conic with center X having 3-point contact with C at these points. 

The full bifurcation set of the family d consists of those points X for which (i) 
d has a degenerate stationary point (d^ = dss = 0) for some s or else (ii) there are 
two distinct si, S 2 and d has an ordinary stationary point (d^ = 0) at each one, 
and the same value there: d(si,X) = d{s 2 ,X). The latter is precisely the AASS 
as defined 2 above. Mathematically the AASS and the MPTL ‘go together’ in 
the same way that the classical symmetry set and evolute go together, or the 
ADSS and the affine evolute go together: in each case the pair makes up a single 
mathematical entity called a full bifurcation set. A good deal is known about 
the structure of such sets, including the structure of full bifurcation sets arising 
from families of curves. See For instance, the symmetry set has endpoints in 
the cusps of the evolute, and in the same way the AASS has endpoints in cusps 
of the MPTlH 

When X lies on the AASS there are two chords through X, with X the mid- 
point of each chord, and the areas defined by d are equal. From the midpoint 
conditions alone it follows that the four endpoints of the chords form a paral- 
lelogram. The tangent to the AASS at X is in fact parallel to two sides of this 
parallelogram. 



^ The ADSS as defined in jH)! also has endpoints and these are in the cusps of the 
affine evolute. The endpoints of the AASS are by contrast in the cusps of the midpoint 
parallel tangent locus. 
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Consider the AASS of for example a triangle, as in Section 2, where it was 
noted that the affine medial axis comes from the center of a side and stops at the 
centroid of the triangle. The ‘full’ AASS, allowing for non-absolute minima of d, 
stops at the point halfway up the median, as can be verified by an elementary 
calculation with areas. Presumably this is a highly degenerate version of a cusp 
on the AASS. With the triangle, the two branches of the cusp are overlaid on 
each other. 

We mention finally one curious phenomenon connected with the AASS. Given 
a point C(si) there will generally be an area-hisecting chord through this point. 
That is, the two areas on either side of the chord and within C are equal. In 
that case let X be the midpoint of the chord, and let C{t{si)) be the other end 
of the chord. Let S 2 = t(si): then t{s 2 ) = si by construction and X satisfies 
the conditions to be a point of the AASS. That is, the midpoints of all area- 
bisecting chords automatically appear in the AASS. These points are in some 
sense anomalous: the ‘genuine’ AASS consists of the other points satisfying the 
defining condition. 

3 Robust Numerical Implementation for Discrete Curves 

Inspired by propose the following algorithm to compute the affine skele- 

ton: 

1) Discretization of the curve: Discretize C{s) = {x{s),y{s)) with two vec- 
tors for the points = {xk,yk) with 1 < k < M (see Fig.dC). 

2) Discretization of the rectangular domain: Discretize the domain that 

contains the curve, of dimensions x Ly, with a uniform grid of x Ny 
points. Each point Xij of the grid will have coordinates X^j = (iAxjAy), where 
Ax = Lx/Nx, Ay = Ly/Ny (See Fig.^J-C) and 0 < i 0 < j < Ny. 

Now, for each point Xij of the grid we perform the following steps: 

3a) Compute the chord areas: With Eq. |21 compute the areas between Xij 
and each point in the curve Ck'- 



for k = 1, . . . , Af . The integral is computed by approximating the curve with 
a polygon that interpolates the points Ck, and computing the point Cf. as the 
intersection between the curve and the line joining Ck and Xij. As mentioned 
before, the distance is not defined if the chord does not touch exactly two points 
on the curve. The points have to be labeled with a logical vector Ek indicating 
whether or not the distance is defined at Ck. Here we detect these singular points 
just by scanning around the curve and counting the crossings between the line 
that contains Ck and X^. Then we store the areas in a vector dk = d{Xij,Ck), 
with fc = 1, . . . , M. 

3b) Search for local minima of the chord areas, approximated by the 
local minima of the set dk, with k = 1,...,M. When dk-i, dk and dk+i are 
defined, the local minimum condition is dk-i > dk < dk+i. When dk and dkjzi 
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are defined but dk±i is not defined, the condition is simply dk < dk^\. Now, for 
each point of the grid Xij we shall have after this step the set of I local minima 
distances (d*, d*) corresponding to the points (Cjf, C|, Cf ). All these 

quantities are functions of Xij . 

3c) Approximate AASS computation: Compute the differences 



D{X,,) = d;-d*g P = ( 8 ) 

p and q represent the indexes of the local minimum distances of different chords. 
If this difference is smaller in magnitude than a given tolerance e, we can consider 
Xij to be an approximate point of the area symmetry set (see Fig. (Pl-C)). Then, 
as a first approximation, add to the AASS all the points Xij that satisfy 

\d;-dl\<e (9) 

The tolerance e has to be of the order of the variation of the distance difference 
along a cell in the discretized domain, i.e. 



e ~ max 



dD 

dx 



Ax, 



dD 

dy 




( 10 ) 



The partial derivatives are taken respect to the components of Xij. 

There is not a simple general formula for this expression. However, by using 
the distance definition Eq. o and by restricting ourselves to regular local mi- 
nima which satisfy ds{s*) = 0, we can demonstrate that Xd*{Xij) = {—Ay*, 
Ax*), where {Ax*, Ay*) are the components of the chords C*' — C* corre- 
sponding to local minimum area. This formula is not general, since we may 
have non-regular minima, as for example, points where d{X, s) has a disconti- 
nuity respect to s. However we still use this expression because we only need 
one order of magnitude for the tolerance. Then, we define the tolerance to be 
e = max {\Ay* - Ay*\Ax,\Ax* - Ax*\Ay) . 

3d) Focusing of the AASS: At this point, the skeleton is quite crude, and 
the branches have an spatial error of the order of the discretization of the do- 
main {Ax, Ay). We can compute with negligible computational cost the (small) 
correction vector AXij = {u, v) to the position Xij that makes the difference of 
distances exactly equal to zero at Xij -|- AXij: 



d{x,j + Ax,j,c;) = d{x,j + Ax,j,c;) 

(see FiglU-C). At first order, we must solve 

D{X,j) + AX,j ■ VD{X,j) = 0 

with AXij parallel to the gradient of D. We get 

_ {Ay; - Ayl)D _ -{Ax; - Ax;)D 
“ G2 ^ G2 

G^ = {Ay; - Ay;)^ + {Ax; - Ax;)^ 



( 11 ) 



(12) 



( 13 ) 
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After this step, the AASS for a discrete image was computed. 

3e) Pruning: If one of the distances d* or d* is not a positive global mini- 
mum, then discard the corresponding point. In this way we obtain the affine 
skeleton. We also have to discard points which are originated very close each 
other at the curve in such a way that they are effectively undistinguishable. 
Here we discard points such that the area of the triangles A{Cp, Xij,Cg) are 
smaller than the areas of the triangles defined by the discretization of the curve 
A{Cp, Xij,Cp+i) + A{C'p,Xij,Cp_^_i) and the area defined by the discretization 
of the domain AxLy AyL^, as indicated in Fig. Q]-D. 



4 Examples 



In the following examples, we compute the skeleton in a domain discretized with 
Nx = Ny = 200 points. 

Skeletons may be used to detect symmetries, topic that has been the sub- 
ject of extensive research in the computer vision community, e.g., |!-il4llli22j . In 
particular, affine skeletons may be used to detect skew symmetries. Numerical 
experiments show that if the skeleton contains a straight branch then a portion 
of the curve has skew symmetry respect to this line. |j This is illustrated now. 
In Fig. 0-A we show the original figure with its corresponding skeleton, while in 

Fig. I2I-B we show the figure affine transformed by the matrix A = ( ^ . In 



Fig. laC we corrupted the shape by adding to each point of the discrete curve a 
random number of amplitude 0.025. The corresponding skeleton remains almost 
unchanged. 

We now show how to compute skeletons from real data. First we need to 
obtain the points of the curve C which define the shape. If the points are to be 
extracted from a digital image with a good contrast, they may be extracted by 
thresholding the image in binary values and then getting the boundary with a 
boundary- following algorithm m This procedure was performed with the shape 
of the right in Fig. Hd and with the tennis racket of Fig. He. When the border 
of the shape is more complex or fuzzier, as in the shape of the left in Fig. EJD , 
more sophisticated techniques can be used. Here we extracted the contour with 
the “snakes” algorithm as formulated in The resulting skeletons are shown 

in Fig. 0D. In Fig. He, the image has an approximate skew symmetry, and note 
that the skeleton contains a short straight branch. 



5 Conclusions and Open Questions 

In this paper, we introduced a new definition of affine skeleton and a robust 
method to compute it. The definition is based on areas, making the skeleton 

® Note that although the Euclidean skeleton of a symmetric shape contains a straight 
line, this is not true anymore after the shape is affine transformed, obtaining skew 
symmetry. 
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remarkably insensitive to noise, and thus, useful for processing real images. There 
are still theoretical problems that have to be solved in order to build a consistent 
theory to support our definition: 

(a) The skeleton in the Euclidean case may be found by detecting the shocks in 
the solutions of the Hamilton- Jacobi equation (this is equivalent to the Huygens 
principle in Blum’s method). We do not know at this point whether there is a 
differential equation which would allow us to compute our affine skeleton in an 
analogous way. 

(b) Additional properties of the area-distances d{X, s) and AASS, analogous to 
those for the ADSS and AESS, are to be further investigated. 

(c) The extension of the definition to multiply connected curves would be of 
paramount importance in practical applications. 
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Fig. 1. A) We define the affine distance from X to the curve as the area between 
the curve C and the chord (C(s), C(s')). The chord touches the curve exactly twice. 
For example, there is no “legal” chord which contains the pair of points {C{s''),X). 
As a consequence, the function d[X, s) may have discontinuities as shown in B. C) 
Discretization of the curve and the domain. D) We consider two points p, q “undistin- 
guishable” if the area of the triangle A{Cp, X,Cq) (dashed) is smaller than the area 
defined by the discretization of the curve A{Cp, X,Cp+i) + A{Cp,X,Cp^i), and the 
discretization of the domain (shaded regions). 




Fig. 2. (In lexicographic order) A) A concave curve and its skeleton. B) After an 
afRne transformation, the skeleton keeps the information about the original symmetry. 
C) When noise corrupts the curve, the main branches of the skeleton may still be 
recognized (in this computation, M = 115 and N — 200). D) Affine skeletons of shapes 
extracted from digital images. E) A figure with approximate skew symmetry. 
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Abstract. We formulate the problem of reconstructing the shape and 
radiance of a scene as the minimization of the information divergence 
between blurred images, and propose an algorithm that is provably con- 
vergent and guarantees that the solution is admissible, in the sense of 
corresponding to a positive radiance and imaging kernel. The motivation 
for the use of information divergence comes from the work of Csiszar 0 , 
while the fundamental elements of the proof of convergence come from 
work by Snyder et al. m, extended to handle unknown imaging kernels 
(i.e. the shape of the scene). 

1 Introduction 

An imaging system, such as the eye or a video-camera, involves a map from 
the three-dimensional environment onto a two-dimensional surface. In order to 
retrieve the spatial information lost in the imaging process, one can rely on prior 
assumptions on the scene and use pictorial information such as shading, texture, 
cast shadows, edge blur etc. All pictorial cues are intrinsically ambiguous in that 
prior assumptions cannot be validated using the data. 

As an alternative to relying on prior assumptions, one can try to retrieve 
spatial information by looking at different images of the same scene taken, for 
instance, from different viewpoints (parallax), such as in stereo and motion (note 
that we must still rely on prior assumptions in order to solve the correspondence 
problem). In addition to changing the position of the imaging device, one could 
change its geometry. For instance, one can take different photographs of the 
same scene with a different lens aperture or focal length. Similarly, in the eye 
one can change the shape of the lens by acting on the lens muscles. There is a 
sizeable literature on algorithms to reconstruct shape from a number of images 

* This research was supported by NSF grant IIS-9876145 and ARO grant DAAD19- 
99-1-0139. The authors wish to thank J. C. Schotland and J. A. O’Sullivan for useful 
discussions and suggestions, and S. Nayar and M. Watanabe for kindly providing us 
with the test images used in the experimental section. 
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taken with different imaging geometry (shape from defocus) or from a controlled 
search over geometric parameters (shape from focus) 0 . Recently the conditions 
under which it is possible to obtain a unique reconstruction have been derived 

Estimating shape from focus/defocus boils down to inverting certain integral 
equations, a problem known by different names in different communities: in signal 
processing it is “blind deconvolution” or “deblurring”, in communications and 
information theory “source separation”, in image processing “restoration”, in 
tomography “inverse scattering” . Since images depend both on the shape of the 
scene and on its reflectance properties - neither of which is known - estimating 
shape is tightly related to estimating reflectance^. In this paper, we consider 
the two problems as one and the same, and discuss the simultaneous solution 
of both. We choose as criterion the minimization of the information divergence 
(I-divergence) between blurred images, motivated by the work of Csiszar |S| . The 
algorithm we propose is iterative, and we give a proof of its convergence to a 
(local) minimum. We present results on both real and simulated images. 

1.1 Statement of the Problem 

Consider a piecewise smooth surface represented symbolically by a. For instance, 
a could be the parameters in a parametric class of surfaces, or it could be a 
smooth function such that cr{x, y,z) = 0 (note that it may not necessarily be 
finite-dimensional) . Consider then an imaging system whose geometry can - to a 
certain extent - be modified by acting on some parameters u GlA C. for some 
k. For instance, u could be the aperture radius of the lens and the focal length. 
The image at a point {x, y) in a compact subset of the plane D C is obtained 
by integrating the energy radiated by the surface0 , which we represent as a (non 
necessarily continuous) positive- valued distribution R defined on ct, over a region 
that depends upon u. Due to the additive nature of energy transport phenomena, 
the image is obtained by integrating the energy distribution R against a kernel 
h that depends upon a and u. We therefore write 

Iu(x,y) = J h'^{x,y)dR (x,y) G D. (1) 

Notice that, in the equation above, all quantities are constrained to be positiv^ 
dR because it represents the radiant energy (which cannot be negative), 
because it specifies the region of space over which energy is integrated, and I 
because it measures the photon count hitting the surface of the sensor. 

We are interested in estimating the shape of the surface a and the energy 
distribution R to the extent possible, by measuring a number I of images obtained 
with different camera settings u\, . . . ,ui. 

^ Since neither the light sonrce nor the viewer move, we do not distinguish between 
the radiance and reflectance of a surface. 

^ We use the term “positive” for a quantity x to indicate a; > 0. When a; > 0 we say 
that X is “strictly positive” . 
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In the literature of computational vision a number of algorithms have been 
proposed to estimate depth from focus/defocus | |ll(il7l9ll 1112118111)118119120) just 
to mention a few. Most of the papers formulate the problem as the minimization 
of a least-squares norm or total variation. 

The capability to reconstruct the scene’s shape depends upon the energy 
distribution it radiates. The conditions on the radiance distribution that allow 
a unique reconstruction of shape have been recently characterized in m- 



1.2 Formalization of the Problem 

If we collect a number of different images 1^ , ■ ■ ■ ,Iui and organize them into a 
vector I (and so for the kernels h), we can write 

= J h’^{x,y)dR {x,y)GD. (2) 

The right-hand side of the above equation can be interpreted as the “virtual 
image” of a given surface a radiating energy with a given distribution R. We 
call such virtual image b: 

y,R)= J h'^ix, y, X, F, Z)dR{X, Y, Z) (x, y) S D. (3) 

Note that, for images of opaque objects, the integral is restricted to their surface, 
and therefore can be written in the Riemannian sense |2| as 

b'^{x,y,R) = J h'^{x,y,x,y)dR{x,y) {x,y) G D (4) 

for a suitably chosen parameterization (x,y) G IR^. In either case, we write the 
integral in short-hand notation as b°'{x,y,R) = f hZ {x,y)dR. Since the image 
/ is measured on the pixel grid, the domain D (i.e. a patch in the image) is 
D = [xi , . . . , a:Ar] X [yi, . . . , j/m] , so that we have 

I{xi,yj) = b’^{xi,yj,R) i = 1 . . . N, j = 1 . . . M. (5) 

We now want a “criterion” (j) to measure the discrepancy between the measured 
image I and the virtual one, so that we can formulate our problem as the mi- 
nimization of the discrepancy between the measured image and the model (or 
virtual) image. Common choices of criteria include the least-squares distance 
between I and 6'’’, or the integral of the absolute value (“total variation”) of 
their difference 

In order to get a “reasonable” result, the criterion (j) should satisfy a num- 
ber of requirements. Csiszar makes this notion rigorous through the axiomatic 
derivation of cost functions that satisfy certain consistency conditions. He con- 
cludes that, when the quantities involved are constrained to be positive^ (such 

® When there are no positivity constraints, Csiszar argues that the only consistent 
choice of discrepancy criterion is the norm, which we have addressed in CS] 
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as in our case) , the only consistent choice of criterion is the so-called information 
divergence, or I-divergence, which generalizes the well-known Kullbach-Leibler 
pseudo-metric and is defined as 



4>{IX{.R)) = XI \ ^a;i,%)log 






b<^{xi,yj,R) 



- I{xi,yj) + b'^{xi,yj,R)\ . ( 6 ) 



In order to emphasize the dependency of the cost function (j) on the unknowns 
cr, R, we abuse the notation to write 



(j) = (j){a,R). (7) 

Therefore, we formulate the problem of simultaneously estimating the shape of a 
surface and its radiance as that of finding a and R that minimize the I-divergence: 

(T, i? = arg min (/) (ct, i?) . (8) 



1.3 Alternating Minimization 

In general, the problem in (0 is nonlinear and infinite-dimensional. Therefore, 
we concentrate our attention from the outset to (local) iterative schemes that 
approximate the optimal solution. To this end, suppose an initial estimate of R 
is given: Rq. Then iteratively solving the two following optimization problems 

f (Tfc+i = argmin,,, 4>{a, Rk) .g-, 

\ Rk+i = argmin_R(^(crfe+i,i?) 

leads to the minimization of (j), since 

4>k+l = i?fe+i) < (j)i<^k+l,Rk) < 4>{ckk^Rk) = 4>k (10) 



and the sequence (f)k is bounded below by zero. However, solving the two opti- 
mization problems in (0 may be an overkill. In order to have a sequence {(j)k} 
that monotonically converges it is sufficient that - at each step - we choose cr 
and R in such a way as to guarantee that equation holds, that is 



( ffk+i \ <l>icrk+i,R) < 4’{ak,R) R^Rk 

\Rk+i \ (l){a,Rk+i) < 4>{a,Rk) a ^ au+i- 



( 11 ) 



2 Minimizing I-Divergence 

In this section we derive an algorithm to minimize the I-divergence and prove its 
convergence. For simplicity, we restrict our analysis to an “equifocal imaging mo- 
del”, that is a model where the kernel h is translation-invariant. This corresponds 
to the scene being approximated, locally, by small patches of a plane parallel to 
the lens. This can be done to an arbitrary degree anywhere on a smooth surface 
away from discontinuities, which will therefore be resolvable only up to the size 
of the patch. 
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2.1 An Elementary Imaging Model 

In order to obtain a simple instance of the problem, we assume that, locally for 
(xi, Uj) in a patch U C D away from discontinuities of a, the kernel h'^{xi, yj,x, y) 
is shift-invariant, so that we can write it as 

h'^{xi - x,yj - y). ( 12 ) 

Equivalently, we represent the surface as a collection of planar patches parallel to 
the lens. We also introduce the density corresponding to the energy distribution 
R and denote it with the function r defined by 

r{x,y)dxdy = dR(x,y). (13) 

Strictly speaking, r is the Radon-Nikodym derivative of R, and as such it may 
not be an ordinary function but, rather, a distribution of measures. We neglect 
such technicalities here, since they do not affect the derivation of our algorithm. 
The imaging process can thus be modeled (locally) as a convolution integral: 

I{x,y) = h'^ *r{x,y) {x,y)GUcD. (14) 

We will further assume that energy is conserved, and therefore 

J h'^{x,y)dxdy = 1 V cr. (15) 

In order to further simplify the problem, we restrict our attention to geometric 
optics, and represent the kernel h'^ by a Gaussian with standard deviation a = 
^\1 — Z/Zp\, where Z is the depth of the scene, Zp is the depth of the point in 
focus and d is the diameter of the lens (see figure [Q. We choose a Gaussian kernel 




Fig. 1. A bare-bone model of the geometry of image formation. 



not because it is a good model of the imaging process, but because it makes the 



760 



P. Favaro and S. Soatto 



analysis and the implementation of the algorithm straightforward. The algorithm 
does not depend upon this choice, and indeed we are in the process of building 
realistic models for the kernels of commercial cameras. 

2.2 Steps of the Alternating Minimization 

In the imaging model just described, the “shape” of the surface is trivial and 
represented by a positive scalar a that depends upon Z, the depth of the patch 
U . Since the first step of the minimization depends only on this parameter, 
we can choose any of the known descent methods (e.g. Newton-Raphson) . The 
choice is arbitrary and does not affect the considerations that follow. Therefore, 
we indicate this step generically as: 



(Tfc+i = arg min cj) (cjk, r) . 

<Tk>0 



(16) 



The second step is obtained from the Kuhn- Tucker conditions |B| associated with 
the problem of minimizing (j) for fixed a under positivity constraints for r: 

J h'^{xi,yj,x,y)r{x,y)dxdy ( ^ J/ii y) 't' {x,y) \ r{x,y) = 0. 

^■>3 

(17) 

Since such conditions cannot be solved in closed form, we look for an iterative 
procedure for that will converge to a fixed point. Following Snyder et al. d, 
we choose 



F(cr,r) 



1 

h<^{xi,yj,x,y) 



E 



h'^{xi,yj,x,y)I{xi,yj) 

b<^{xi,yj,r) 



(18) 



and define the following iteration: 



Tk+i = rkF{a,rk). 



(19) 



It is important to point out that this iteration decreases the I-divergence (j) not 
only when we use the exact kernel /i'^, as it is showed in Snyder et al. but 
also with any other kernel satisfying the positivity and smoothness constraint. 
This fact is proven by the following claim. 

Claim 1 Let xq be a non-negative real-valued function defined on IR^, and let 
the sequence be defined according to m- Then 4>{(j,rk+i) < <t>{cr,rk) VA; > 
0, V cr > 0. Furthermore equality holds if and only if ru+i = fk- 

Proof: The proof follows Snyder et al. m From the definition off) in equation 
(0) we get 



(cr, r-fe+i) - 0(cr, rfe) = - I{xi,yj) log b" . Cfc+l ) - b” {xi,yj , rk). 

Z — t b‘’(Xi,yj,rk) Z — / 



( 20 ) 
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From the expression of rk+i in J TfA) we have that the second sum in the above 
expression is given by 



^ / h'^{xi,yj,x 



,y)rk+i{x,y)dxdy 




h'^{xi, yj,x, y)rk{x, y)dxdy = 



J HQ{x,y)rk+i{x,y)dxdy 



Ho{x,y)rk{x,y)dxdy 



where we have defined Hq (x, y) = j Vj^x, y), while the ratio in the first 

sum is 



b<^{xi,yj,rk) 



= / F{a,rk) 



h’^{xi,yj,x,y)rk{x,y) 

b'^{xt,yj,rk) 



dxdy. 



(21) 



We next note that, from Jensen’s inequality, 



log 



F(ct, r-fc 



h’’(xi,Vj,x,y)rk{x,y) 
b’’(xi, yj,rk) 



dxdy j > 



h”(xi,yj,x, y)rk(x, y) 
b” (xi , yj , Tk ) 



log(F(cr, rk))dxdy 

( 22 ) 



since the ratio ^ can be interpreted as a probability distribution 

” iVj fc/ 

dependent on the parameters a and r^, and therefore the expression in M'.IA) is 



rfc+i) - rfc) < - \ I{xi, yj) 






b'^{x,y,rk) 



J Ho {x, y)rk+i (x, y)dxdy 



Ho{x,y)rk{x,y)dxdy. 



The right-hand side of the above expression can be written as 

MHq (x, y)rk-hi (x, y),Ho (x, y)xk{x, y)) (23) 

where we define (j)cU{x,y),g{x,y)) as J f{x,y)log - f{x,y) g{x,y)dxdy, 
which can be easily verified to be a positive function for any positive f, g. The- 
refore, we have 

(j){cr, Vk+i) - (fia, Vk) < 0. (24) 

Note that Jensen’s inequality holds with equality if and only if F(a,rk) is a con- 
stant; since the only admissible constant value is 1, then we have Vk+i = Vk, 
which concludes the proof. 

Finally, we can conclude that the algorithm proposed generates a monotonically 
decreasing sequence of values of the cost function </>. We say that the initial 
conditions ao, vq are admissible if cto > 0 and vq is a positive function defined on 
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Corollary 1 Let (jQ,ro be admissible initial conditions for the sequences ak 
and Tk defined from equations GB and m respectively. Let fik be defined as 
4>{erk,rk), then the sequence {(fk} converges to a limit <jf : 

lim (pk = ■ (25) 

k—^oc 

Proof: Follows directly from equation and claim^ together with the 

fact that the L-divergence is bounded from below by zero. 

Even if fik converges to a limit, it is not necessarily the case that ak and 
do. Whether this happens or not depends on the observability of the model m, 
which has been analyzed in mu- 

3 Experiments 

In this section we discuss some details that are important for implementing the 
algorithm just described on a digital computer, and test its performance on a 
set of experiments on real and simulated images. 

3.1 Implementation 

Since the algorithm we propose is iterative, we need to initialize it with a fea- 
sible radiance. We choose r = , that is we choose the initial estimate of the 

radiance to be equal to the first image. This choice is guaranteed to be feasi- 
ble since the image is positive. Since h^{xi,yj,x,y) is discrete in the first two 
variables, one needs to exercise caution when performing numerical integration 
against kernels smaller than the unit step of the discretization {xi ,yj). This case 
cannot be discounted because it occurs whenever the patch on the image that 
we are observing is close to be in focus. In our implementation, integrals are 
computed with a first order (linear) approximation as a tradeoff between speed 
and accuracy. 

Another important detail to bear in mind is that it is necessary to choose the 
appropriate integration domain. The fact that we use an equifocal imaging model 
allows us to use the same reference frame for the image and for space, which is 
represented locally by a plane parallel to it. However, the image in any given 
patch receives contributions from a region of space possibly bigger than the patch 
itself. Thus we write L{xi,yj) = f h"^ {xi,yj,x,y){rj{x^y) + ro{x,y))dxdy, where 
Vo is the radiance outside the patch that contributes to the convolution with 
the kernel h'^ . In the real and synthetic experiments we always use two images, 
with planes focused at 529 mm and 869 mm as in the data set provided to us by 
Watanabe and Nayar US!; the lens diameter is such that the maximum kernel 
radius is around 2.3 pixels. With these values the kernel is well approximated 
by a Gaussian since the radius is small compared to the image patch dimension. 
Therefore, we define Vg on a domain that is 3 pixels wider than the domain of 
r/, which we choose to be 7 x 7, ending up integrating on patches of dimension 
13 X 13. 
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3.2 Experiments with Synthetic Images 

In this set of experiments we investigate the robustness of the algorithm to 
noise. Even though we have not derived the algorithm based upon a particular 
noise model (all the discussion is strictly deterministic), it can be shown that 
minimizing the I-divergence can be cast into a stochastic framework by modeling 
the image noise as a Poisson process (the arrivals of photons on the sensor 
surface). 

We have generated 10 noisy image pairs and considered patches of size 7x7 
pixels. Smaller patches result in greater sensitivity to noise, while larger ones 
challenge the equifocal approximation. We have considered additive Gaussian 
noise with a variance that ranges from the 1% to the 10% of the radiance ma- 
gnitude, which guarantees that the positivity constraint is still satisfied with 
high probability. The results of the computed depths are summarized in figure 
El We iterate the algorithm 5 times at each point. 




Fig. 2. Depth error as a function of image noise, mean and std. 



As it can be seen, the algorithm is quite robust to the additive noise, even tough 
if the radiance is not sufficiently exciting (in the sense defined in m) it will 
not converge. All this will be seen in the experiments with real images described 
below. 
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3.3 Experiments with Real Images 

We have tested the algorithm on the two images in figures 0 and 0 provided to 
us by M. Watanabe and S. Nayar. These images were generated by a telecentric 
optic (see ini for more details) where there is no change in scale for different 
focus settings. A side effect is that now the real lens diameter is not constant, 
and therefore we need to correct our optical model according to figure 0 More 
precisely, we substitute the diameter d with the new diameter D = where 

a and / are indicated in figure 0 For this experiment, in order to speed up the 
computation, we chose to iterate the algorithm for 5 iterations and to compute 
depth at every other pixel along both coordinate axes. At points where the ra- 
diance is not rich enough, or where the local approximation with an equifocal 
plane is not valid, the algorithm fails to converge. This explains why in figure 0 
some points are visibly incorrect, and in figure 0 the depth of the white back- 
ground is poorly retrieved. A convergence test could be employed, although it 
would slow down the computation considerably. 




Fig. 3. The modified diameter in the telecentric lens model. 



4 Conclusions 

We have proposed a solution to the problem of reconstructing shape and radiance 
of a scene using I-divergence as a criterion in an infinite-dimensional optimization 
framework. The algorithm is iterative, and we give a proof of its convergence to 
a (local) minimum which, by construction, is admissible in the sense of resulting 
in a positive radiance and imaging kernel. 
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Fig. 4. Original images: near focused (left); far focused (right). The difference between 
the two images is barely perceivable since the two focal planes are only 340 mm apart. 
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Fig. 5. Reconstructed depth for the scene in figure 0 coded in grayscale. 
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Fig. 6. Reconstructed depth for the scene in figure^ smoothed mesh. 




Fig. 7. Original images: near focused (left); far focused (right). As in figure 0 the 
difference between the two is barely perceivable. 
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Fig. 8. Reconstructed depth for the scene in figureQ coded in grayscale. In the uniform 
region of the background, the radiance is not sufficiently exciting, in the sense defined 
in m- Therefore, the algorithm cannot converge and the quality of the estimates, as 
it can be seen, is poor. 




Fig. 9. Reconstructed depth for the scene in figure [3 smoothed mesh. 
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Abstract. A general formulation for geodesic distance propagation of 
surfaces is presented. Starting from a surface lying on a 3-manifold in 
IR^, we set up a partial differential equation governing the propagation 
of surfaces at equal geodesic distance (on the 3-manifold) from the given 
original surface. This propagation scheme generalizes a result of Kimmel 
et al. EH and provides a way to compute distance maps on manifolds. 
Moreover, the propagation equation is generalized to any number of di- 
mensions. Using an eulerian formulation with level-sets, it gives stable 
numerical algorithms for computing distance maps. This theory is used 
to present a new method for surface matching which generalizes a curve 
matching method |S|. Matching paths are obtained as the orbits of the 
vector field defined as the sum of two distance maps’ gradient values. 
This surface matching technique applies to the case of large deformation 
and topological changes. 



1 Introduction and Previous Work 

The theory of front propagation has received a particular attention in the past 
few years mm- It sheds light on deriving new methods for computing geo- 
desic paths on surfaces and manifolds mini; this framework is particularly well 
suited for answering image processing questions (for instance active contours, 
deformable templates iinii2Ei, matching structures 0). The latter problem of 
matching structures is very important in computer vision. A general matching 
formulation can be stated as: given two structures S and T> define a function 
X which associates to any point of 5 a corresponding point on T>. Introducing 
structures properties such as their geometry or the underlying image representa- 
tion allows to characterize a unique matching function y. Relevant geometrical 
properties are selected on the basis of their ability to characterize a description of 
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the structures which is invariant to the considered deformation. In the case of ri- 
gid or small elastic deformations high curvature points laiK] or semi-differential 
invariants m can be considered as an invariant description of the structure. 
Matching features in 3D images is also an important task in 3D medical image 
analysis Mm . 

In this paper we set up a generalization of a curve evolution scheme introduced 
by Kimmel et al. in El We derive a general partial differential equation go- 
verning the front propagation for a family of surfaces lying on a 3-manifold (or 
“hypersurface”) in Considering an eulerian formulation for the projection of 
the surface onto an hyperplane unveils a stable method for computing distance 
maps on a 3-manifold. As a direct application of the theorical framework pre- 
sented in this work, we generalize to the case of surfaces, a matching technique 
introduced by Cohen et al. for curve matching. A brief outline of the method 
is described in the following. Given two surfaces we characterize the similarity 
between these two structures by defining a hypersurface W C IR^ based on 
the computation of two geodesic distance maps derived from the two surfaces. 
These geodesic distance maps are then combined and allow us to define the paths 
matching the structures as the paths lying on W which maximize a similarity 
criterion. 

This new approach to surface matching problem leads to algorithms able to 
handle large deformations and change of topology using the geodesic distance 
map as the key feature for defining a similarity criterion that can be based on 
distance solely or integrating curvature information. The proposed method is 
described throughout the paper and we proceed step by step towards this gene- 
ral surface matching approach. First, we set up the geodesic surface evolution 
method on a 3D hypersurface W embedded in IR"'. This theory makes use of 
the Hodge “star” * operator, a notion briefly reviewed in an appropriate subsec- 
tion. Another subsection is devoted to the level-set formulation of that geodesic 
surface evolution scheme, allowing for the computation of distance maps on the 
3-manifold W. The level-set evolution equation takes the form of Hamilton- 
Jacobi partial differential equations, for which J. Sethian P! has introduced 
stable and robusts numerical resolution schemes described in another section. 
Then, the surface matching method is introduced, with the computation of pa- 
ths between the source and destination surfaces S and 2? which minimize a cost 
function whose graph is W. Results are displayed in a specifc section. Lastly 
we contemplate a generalization of the method to higher dimensional manifolds. 
Then we conclude and sketch some perspectives. 
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2 A Geodesic Distance Evolution Rule for Propagating 
Surfaces on a 3-Manifold 



Let W he a, 3-manifold (or hypersurface) in We suppose W compact and 
pathwise connected0 From these assumptions, one can derive, using an easy 
application of the Hopf-Rinow-De Rham theorem (see CZI) that given any two 
points Mq and Mi € W, there is a unique path 7 : [0, 1] — >■ W connecting 
Mq and Ml (7(0 ) = Mq, 7(1) = Mi) whose length minimizes the lengths of all 
paths between the two points. The length of 7 is called the geodesic distance 
between Mg and Mi, and will be denoted dw{Mo, Mi) in the sequel. Moreover, 
the path 7 is necessarily a geodesic curve on W, i.e. a curve such that the 
d^'Y 

second derivative ^ is always perpendicular to W. Let 3^ C LF be a surface 

(2-manifold) “drawn” on W. We consider the surfaces St C W whose points are 
located at geodesic distance t from 3^: 



^^ = {M GW\dw{M,y)=t} . 



( 1 ) 



We are interested in defining a partial differential equation governing the evolu- 
tion of the surface St as the parameter t evolves. For this purpose, we will need 
a notion of cross-product in 4-space, and a method of deriving simple formulae 
about such a cross-product. Generalizing the cross-product can be done in dif- 
ferent ways. Here, we choose the Hodge * operator, because it provides direct 
formulae needed in the demonstration of intermediate propositions. 



2.1 Exterior Algebras and the Hodge * Operator 

The theory is only briefly recalled here. The reader is referred to for ^ complete 
presentation of the subject. Let if be a finite dimensional vector space over IR, 
and let (ei, 62, • • • , e„) be a basis of E, with n = dimif. For any integer p, 
0 < p < n, one can construct new vector spaces, usually denoted Ap(E), such 
that: 

— by convention, A^(E) = IR and A^{E) = E. 

— HP( if) is the set of formal sums A- • • for multi- 

indexes (ii, ^2, • • ■ *p) and real coefficients 0772, ■■■ipj the being ordinary 
vectors of E. 

The “wedge” products uq A Ui^ A ■ ■ ■ Ui are supposed to be multilinear in the 
variables uq , uq , • • • , Ui^ and alternating, that is to say uq A uq A • • • A uq = 0 
if one of the vectors in this wedge product is equal to another vector. From 
these conditions, one can derive that for every p, 0 < p < n, Ap{E) is finite 

^ These assumptions do not restrict the validity of the theory presented in this work, 
as the manifold W will appear as the graph of a cost function, which automatically 
satisfies these requirements in practice. 



772 



E.G. Huot et al. 



( Tl\ Tl\ 

I = —77 TT. A basis of A^iE) is given by 

Pj p\{n-p)\ 

the family of vectors A A • • • A with 1 < < • ' ' V < n. For 

instance, if E is 4-dimensional space with standard basis (61,62,63,64), the 
standard basis of A^(IR^) is (61 A 62 A 63, 61 A 62 A 64, 62 A 63 A 64, ei A 63 A 64), 
and the standard basis of A"‘(IR'^) is (61 A 62 A 63 A 64). 

Now suppose a dot product < •,• > is defined in E. A general dot product, 
usually denoted < #, • >p can be defined in A^{E) by: 



< ui A U2 A • • • A Up, wi A W2 ■ ■ ■ Wp >p= det(< Ui, Wj >) (2) 



In equation 13 the quantities < Ui,wj > inside the determinant are ordinary 

dot products in E. Since vector spaces Ap{E) and 

A^'^~p\E) are isomorphic. The Hodge * operator provides a standard isomor- 
phism between these two vector spaces. It is defined in the following manner. 
Let A and p be two elements of Ap{E) and A^'^~p\E) respectively. The image of 
A by the Hodge operator, denoted *A, belongs to A^'^~p\E) and is characterized 
by the equality: 

A A p =< *A, p >n-p 61 A 62 A • • • A 6„ (3) 

Let us now see how all of this operates in practice. Suppose that E is the standard 
3-space IR^, and that u and v are two vectors in Using elementary calculus 
and equationEl one can easily show that *{uAv) is the standard cross-product in 
If w is another third vector in IR^, *{u Av Aw) is simply the determinant of 
the three vectors u, v, w, also known as their triple product. Now take E = IR^ (it 
is the case that interests us in this work), and choose three linearly independant 
vectors u, v and w in IR"^. It is easy to prove that the associated image *{uAvAw) 
satisfies the following properties: 



— it is a vector in IR^ perpendicular to u, v and w. 

— The basis (it, v, w, *(u Av A w)) is positively oriented. 





ll2 M3 M4 




Ui Us 




U\ U2 Ua 




Ml M2 M3 


It has components (— 


V2 M3 M4 
W2 IM3 IM4 


5 


Vi Vs Va 
W i Ws Wa 


7 


Vl V2 Va 
W i W2 Wa 


7 


Ml M2 M3 
IMi W2 IM3 



in the standard basis of A^(IR^) ( it = 1(161-1-11262+11363-1-11464 and similarly 
for v,w). 



— Its squared norm : || *{uAvAw) ||^ is equal to 



< u,u> < u,v > < u,w > 
<v,u> <v,v> <v,w> . 

< w,u > < w,v > < w,w > 



Note that the last equality about the norm || *(it A v A w) |p generalizes the 
usual formula about the norm of the ordinary cross-product in 3-space. 

With this notion of cross-product in Tspace given by the Hodge * operator, we 
can now derive the geodesic distance evolution scheme for the family of surfaces 
St- This is presented in the following subsection. 
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2.2 The Geodesic Distance Evolution Equation 



We consider a local orthogonal parametrization a(u,v,t) of the surface i-e. 
a parametrization a(u,v,t) such that < >= 0, with = -n rr and 

II an II 

t'" = TT . It is a known result that an orthogonal parametrization can always 

II ay II 

be found (see |0| P- 183): 

Lemma 1 Around any point on a surface one can always find a local orthogonal 
parametrization. 

So let a{u,v,t) be a local orthogonal parametrization of a family of surfaces in 
W, and let r’' = ,, ,, denote the two unitary tangent vectors 

II an II II ay II 

determined by a and At the normal vector to W . We suppose that the family 
a(u,v,t) satisfies the following partial differential equation: 



— = *(7VAt“ Ax") (4) 



We want to prove that, for each t, a{u,v,t) is a local parametrization of .r:*. 

Let 7(t) be the curve in W defined by: 7(f) = a(w, n, fixed- Then: 

Lemma 2 For any uo,vq, the curve 7(f) is a geodesic in W. 

Proof. We prove this lemma by showing that is perpendicular to x", 
and *{N A x“ Ax"), N being the normal to W. The only possibility that will 
remain will be then: 744 is colinear to N which means that 7 is a geodesic. 

First note that 

SO 



||7t||2= |*(ATAx“Ax")||" 

(AT, AT) (AT,x“) (AT,x") 

= (x“,Af) (x-,x“) (x",x") 
(x",Af) (x",x“) (x",x") 

= 1 - (TV, - (TV, x")^ = 1 

since the parametrization is orthogonal. This proves: 






d^7 dj \ 
dt^ ’ dt / 



-(*(ATAx“Ax")),*(TVAx“Ax") ) =0. (5) 



Hence, 7*4 is perpendicular to *(TV A x“ A x"). 

Now, using methods similar to the one used in the 2D case (see proof of Lemma 
1 in El), one can show that 



= 0 , 



( 6 ) 
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and similarly 



iltt,r'") = 0 . 



The curve ^{t) is a geodesic in W. 



( 7 ) 



Lemma 3 The evolution of the family of surfaces is given hy the equation 

^ = *(ATAt“Ax’') (8) 

The proof of this lemma is not given here, as it is a simple adaptation, using the 
the general Gauss Lemma of the proof given in (Lemma 2). 

The two preceding lemma demonstrate that, using the Hodge * operator, it is 
possible to derive a geodesic distance evolution scheme for a family of surfaces 
described by local orthogonal parametrizations. In the next subsection, we com- 
pute the normal speed of the projection of St onto the (x, y, z) hyperplane in 



2.3 Normal Evolution of the Projection of St onto the {x, y, z) 
Hyperplane 

To generalize the 2D case, we now make the assumption that W is a, graph 
hypersurface, that is to say W = {(x, y, z, w{x, y, z))} for a function w : IR^ — > 
K. Let 7T : IR"*^ — IR^ be the canonical projection onto the (x, y, z) hyperplane 
in IR'* and let S{t) be the projection of the image of a{u, v, t) (that is to say, St) 
onto that hyperplane: 



S{t) = 7T o a. 



. , , dw dw dw 

We denote by p, q, r the following quantities: p = , q = and r = . 

ox oy oz 

Starting from a result mentionned in (3, we admit that the trace of a propagating 
surface may be determined only by its normal velocity, as the other components 
of the velocity influence only the local parametrization. Our goal is then to 
compute the projected velocity of the evolving surface V = {noatjU), n = 
(ni, 712 , na) being the normal to the projected surface tt o a{u, v, t). We state the 
following result: 

Lemma 4 The projected surface S{t) satisfies the normal propagation rule: 



with 



V = 



\ 



a 



( 9 ) 



iPy I Tyj Pz) 



( 10 ) 



= -\/anf + bn\ + cn\ — dniU 2 — en\n^ — fn 2 n^, 
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M being the matrix of a quadratic form: 

' 1 + + r"^ —pq 



M = 



1 



1 + p^ + q^ + r"^ 



—pq l+p‘^ + r^ 



—pr 

—qr 



-pr 



—qr l-\- p^ q‘‘ 



= -Q^> etc.) 

Sketch of the proof. 

Ctu Cty 

The tangent vectors r“ = ^ and r'" = ^ are given by their coordina- 



tes: 



II «« II 

Uui ^ti) 



II C.. II 



V^l + yl + + 

so that one finds easily: 



and 



{Xy,y^,Zy,Wy) 

xl + yl + z); + wl 



N = 






The normal velocity of the surface’s projection onto the {x,y,z) hyperplane is: 

V = {n o at, n) 

with TT o at having coordinates (obtained from the Hodge operator, see sec- 
tion rz. I II : 



1 

K ^ 



qZuWy - qWuZy - ryuWy + ry^Wu -h VvZu 
-qZuWy + qWuZy rpuWy + yuZy - ryyWu - yvZu 
pyuZv - PVvZu - qXuZv + rxuyv + qXyZu - rx^yu 



with 



K = -hp2 + g2 r'^y/xl -h 2/2 + zl\/xl y^ -i- z); 
Using the fact that the parametrization a{u, v) is orthogonal: 

XuXy PuPy ZyZy WyWy 0 

it is possible to simplify the scalar product and get 



V = 



\ 



Px 

PV I 



= \/ anf + bn\ + cn| — dniU2 — enius — fn2U3 



which completes the proof. 
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Having found the normal equation evolution of the projected surface, we now 
proceed to set up an Eulerian formulation for S{t), by writing that projected 
surface as a level-set (^“^(0). This idea was introduced by Osher and Sethian m 
for crystal growth modelling. Its major advantage is the ability to handle topo- 
logical changes and singularities while insuring stability and accuracy. Moreover, 
this level-set formulation gives us the ability to compute distance maps on the 
graph of the cost function. We derive such a formulation in the next subsection. 



2.4 Level- Set Formulation 

Given a function ip : IR^ — > IR such that its zero level-set tracks the projected 
surface S{t) = we determine, in the following lemm£0 , the equation 

governing the evolution of tp: 

Lemma 5 The function p follows the following propagation equation: 



dp 

Ik 



dp^ ,dp^ dp^_,dpdp_dpdp_dpdp 
“Sx dy dz dxdy dx dz ^ dy dz ^ ’ 



the coefficients a, b, c, d, e and f having the same values as in eauaiion \1(]\ 



1 + + r'^ 1+p^ + r'^ 

a= . „ .. 0 = 



l+p + q 



d= 



\ + p'^ + q^ + r'^’ 1+ p^ + q^ + 1 + p"^ + q^ + r'^’ 

2pq ^ 2pr 2qr 

} I 



\ + p"^ + q^ + r"^’ 1 + p^ + q^ + r'^’ 1 + p^ + q^ + r'^ 



Proof. To derive equation one simply uses equation E3 together with the 
chain rule: 



dp 

Ik 



< v<^ 



95 

'Ik 



> 



(12) 



and the fact that the normal n is given by the gradient vector n = ^ . 

As mentionned in the curve evolution process described in HH such an Eule- 
rian formulation leads to numerical resolution schemes able to handle problems 
caused by a time varying coordinate system (u,v,t)'- curvature singularities and 
topological changes jI0|. We describe, in the following subsection, the numeri- 
cal resolution method used to solve equation Then we use this numerical 
algorithm to build distance maps. 



2.5 Numerical Resolution 

The numerical implementation is a generalization of the finite difference appro- 
ximation described in for Hamilton- Jacobi type equations. It consists in 

an explicit temporal scheme where spatial derivatives are approximated by finite 

^ It is important to note that the function p depends not only of {x,y,z) € IR^, but 
also of the parameter t. To simplify the notations, we do not write explicitely that 
dependence on t, but it is important to keep it in mind. 
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differences using the minmod function. The derivative estimates can be boun- 
ded, and the variations of the solution can therefore be controlled. The minmod 
function is defined by: 



minmod(a, b) 



sign(a)min(|a|, |&|) if a < 6 
0 if a > 6. 



Spatial derivatives are estimated on a discrete three-dimensional grid by: 

(fix{ihx,jhy,kh^) = mmmod{D+ip{ihx, jhy, kh^), D~ (fi{ihxjhy, kh^)) 
(fiy{ihx,jhy, khz) = minmod(D+ ip{ihx,jhy, kh^), D~ ip{ihx, jhy, khz)) 
tpz{ihx,jhy,khz) = mmmod{D+ip{ihx,jhy,khz),D-(f{ihx,jhy,khz)) 



with hx, hy and hz being the spatial discretization steps. D~ and (respec- 
tively D~ , D+ and D~ , D+) are the left and right derivatives in the x (resp. y 
and z) direction. They are defined by: 

n- ru h ‘P{{{i+l)hx,jhy,khz) - (fi{ihx,jhy,khz) 

Dx ‘p{ihx,jhy, khz) = 



and 



D^ip{ihx,jhy,khz) = - 



ip{ihx,jhy,khz) - (p{{i - l)hx,jhy,khz) 



To compute the squared derivatives, one uses El: 



>fl{ihx,jhy,khz) = {max{D+ip{ihx,jhy,khz),-D-ip{ihx,jhy,khz),0))‘^ 
‘Py{-ihx,jhy,khz) = {max{D+(fi{ihx,jhy,khz), -D-(p{ihx,jhy,khz),0))'^ 

ipi{ihx,jhy, khz) = (max(D+ ip{ihx,jhy, khz), -D~(p{ihx,jhy, khz),0))^. 



With these spatial derivative estimations, we can now write the discrete numeri- 
cal scheme for solving equation II 1 1 The function ip is computed at the locations 
(ihx,jhy,khz) of a three-dimensional grid using a recurrence sequence denoted 
‘p{ihx,jhy,khz)'^ and defined by the relation^: 

(^^+1 = (p'^ + [a{max{D+ip^,-D-ip'^,0))‘^ 

+b (max (D+ ip'^ , - D~ ip'^ , O))'^ 

+c{max{D+ip^,-Djip'^,0))‘^ 

—d minmod minmod {D^ ip'^ , D~ ip'^) 

— e minmod ( tp'^ , D~ ip^ ) minmod {Df ip'^ , D~ip'^) 

—f minmod (D+ ip'^ , D~ ip'^') minmod (Z?+ ip'^ , D~ ip'^)]^ At 



This explicit scheme is conditionally stable, and the convergence to a stationary 
solution is achieved when the time step At and the spatial steps hx, hy and hz 
satisfy the Courant-Friedrish-Lewy condition: 



At < 



1 

Ta\Yi{hx,hy, hz) 



( 14 ) 



3 



To simplify the notations, pA stands for p(ihx,jhy,khzY and similarly for 
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In practice the spatial resolution is given by the image and the r resolution 
is determined to satisfy m Using this numerical scheme, we can now use the 
computed function ip to generate distance maps on a manifold. This is explained 
in the following subsection. 

2.6 Distance Maps on a 3-Manifold 

In order to use equation El for computing the geodesic distance map of the 
surface described by a parametrization a{u,v,0) on the 3-manifold W, we have 
to define an initial estimate <po such that the initial surface is represented through 
a level-set of <po . This initial estimate can be obtained in several ways according 
to the data. We use a Euclidean distance map |2] in such a way that: 

{ -d{x,y,z) if (x,y,z) is inside 
0 a (x,y,z) G (15) 

+d{x,y,z) if (x,y,z) is outside <Po^(0) 

Given a graph hypersurface W and the initial estimate (po on this hypersurface, 
equation Elcharacterizes the distance map of the area which boundary is defined 

by 7>o ^(0)- 

The tools presented in this section can now be used to perform a surface matching 
process. 

3 A General Surface Matching Process 

We now use the previous theory to build a general matching process between two 
arbitrary surfaces S and 2? in IR^. The matching process consists in computing 
the paths on a graph hypersurface such that these paths minimize a cost function. 
This approach is interesting as it does not rely on a parametrization of the two 
surfaces S and T>: they are only represented as 0-level-sets of two functions <po 
and '00 • Moreover the process can take into account large deformations between 
the two surfaces and even a topological change. The two functions tpo and tpo are 
computed from the initial data using the rule presented in El Then given these 
two initial functions po and ipo, the numerical process presented in equation El 
is used to generate two distance maps on a graph surface W : 

Ds = {{x,y,z,p{x,y,z))} 

and 

Dv = {{x,y,z,'t!){x,y,z))}. 

The hypersurface graph W is chosen in order to incorporate a geometric criterion 
of similarity. It is chosen such that the two surfaces (/3(C^(0) and '0,0^ (0) are level- 
sets of W. At this point one may choose to incorporate only distance information 
or curvature in the definition of W. A possible choice for W using only distance 
is given by: 



^ = (x,y,z,w(x,y,z)) = (x,?/,2:,min(|(^o|, |'0o|)) 



(16) 
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To take into account curvature, we use a matching criterion p, function of the 
euclidean distance d and of the relative difference between the mean curvatures 
An = Ks — K.V, where ns and kx> are respectively the mean curvatures of S and 
T>. The p function is defined in such a way that the influence of mean curvature 
decreases as the euclidean distance d increases: 



p{An, d) = 1 — 



Ak^ 

1 + cP-And- jo 



a being a scale parameter defining the neighbourhood around the two init ial 
surfaces where mean curvature is taken into account. This leads to: 



W = (x,y,z,min(l(polp(AK,(po), \i^o\p{AK,tpo))) ( 17 ) 

The mean curvature is easily computed from the level-set representation of the 
surfaces: 

_ 1 

(^iPyy-\-(Pzz)Px~^{^xxPPzz^Pyp{^xxPPyy^^ z ‘^^xPy Pxy ‘^p’xPzPxz ‘^PyPzPyz 

Once Ds and Dj) are computed, we obtain the matching function between S and 
T> by determining the paths on the graph surface W starting at S and ending 
on T> which minimize a cost function. This cost function is: 

f{x,y,z) = (fi{x,y,z) +il^{x,y,z) ( 18 ) 

due the following lemma, which is generalization of the proposition found in HH: 



Lemma 6 All the minimal paths s — > 7(5) between S and T> on W minimize, 
for any value of parameter s, the sum 

dw{'l{s),S) + dw{l{s),V). 

This cost function ultimately justify the process of building distance maps on 
W. For each point Ms G 5 on the first surface, we determine a path ending 
at an unknown point Mj) G T> on the second surface and minimizing the cost 
function f{x, y, z). The cost C{’j) of a path s — 7(5) is defined to be: 

nMx> 

C{l)= j f{x,y,z)ds ( 19 ) 

J Ms 

Hence the minimal paths s — > 7(3) (where s is the arc-length) are determined 
as the orbits of the vector field 



-V/ = -(V^ + VV-) 

and they are computed using a Runge-Kutta. We give some examples in the 
next section. 
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4 Results 

We apply the matching process described in the previous section on a first ex- 
ample illustrating the ability of the model to cope with a topological change and 
large deformation. The initial surface S is given by two spheres and the destina- 
tion surface T> is the shape of an ellipsoid containing the two spheres. On figure E 
we show the computation of the matching paths, starting from the ellipsoid. The 
graph of the cost function W used in this example is given incorporates only the 
distance. This example clearly proves the ability of matching dissimilar surfaces 
with distinct topologies. 




Fig. 1. Matching paths between a surface made of two spheres and an ellipsoid. The 
paths start from the ellipsoid. 



We show, in another example displayed on figure El the use of a graph function 
W incorporating distance function only, as in equation El The initial surface S 
is given by a digital elevation model, and the destination surface is computed by 
applying a geophysical deformation model to S. This geophysical model is used 
to model possible deformations observed in volcanic regions. 

5 Generalization to Higher Dimensions 

We sketch in this section a work in progress, consisting in generalizing the theory 
introduced in this research to higher dimensions. We first note that the Hodge 
* operator can be used to define a notion of “cross product” of n — 1 vectors 
in M" . Hence we can guess a possible generalization to the geodesic distance 
evolution scheme E| as 

o 

^ * (AT A A A • • • A 



(20) 
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Fig. 2. Matching paths between a surface obtained from a digital elevation model 
and a destination surface computed by a geophysical model. The graph of the cost 
function incorporates curvature information. The destination surface is represented 
with transparency. 



where a{ui, M 2 , ... , m„_ 2 , t) is an orthogonal parametrization of a family of n — 2- 
manifolds embedded in a hypersurface W of K". Equation EO represents the 
geodesic distance evolution of a family of n — 2-manifolds in W. (The proof is 
similar to the one presented in section 3). One can then generalize the theory 
presented in the previous sections, and set up an Eulerian formulation for the 
projected manifolds tto a(Mi, M 2 , . . . , m„_ 2 , t), where tt is the projection onto the 
(xi,a: 2 , ■ ■ • T^n-i) hyperplane in M". To achieve this, the n — 1-manifold W is 
here again written in the form of the graph of a function w : — > IR. 

du) 

One can then introduce the quantities pi = — — , with 1 < f < n — 1, and the 

UXi 

projected manifolds tt o a(Mi, M 2 , . . . , Un- 2 , t) are written in the form (/3“^(0), for 
a function p : © IR — >■ IR. Equation [HI becomes: 

^ = \fumu 

dt 
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M being the matrix of a quadratic form: 





fi 


+ Y.p'j 


-PlP2 ■ 


■ -piPn-1 


1 




-P2P1 




■ -p2Pn-l 










• 1 + E p 

jpn-l 




\ 


-Pn-lPl 


-Pn-lP2 ■ 



and U the column vector whose components are 



d(fi 

dxi' 






U = 



( ^ \ 

dxi 

dip 

dx2 

dp 

V dXn-1 / 



(21) 



The numerical algorithm introduced in section 2.5 leads to a simple generaliza- 
tion, and permits the computations of distance maps on a n-dimensional grid. 
The matching process described in section 0 can then be extended in this n- 
dimensional context. 



6 Conclusion 

This paper presents a general formulation for the propagation of fronts in any 
dimension. A special emphasis is given to the case of propagating surfaces on a 
3-manifold embedded in This theory unveils a general method for compu- 
ting distance maps on manifolds. The algorithms used discards the drawbackks 
of curvature singularities and topological changes for the projected manifolds. 
An application to the problem of surface matching is presented. The matching 
algorithm makes use of distance maps to compute optimal paths on a cost mani- 
fold. The optimal paths minimize a cost criterion which can incorporate various 
geometrical properties such as distance and curvature. We give two examples of 
cost surface, but model is general enough to include various matching criteria, 
each leading to a particular cost function. 
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Abstract. This paper presents a new geometric relation between a solid 
bounded by a smooth surface and its silhouette in images formed under 
weak perspective projection. The relation has the potential to be used for 
recognizing complex 3-D objects from a single image. Objects are mod- 
eled by showing them to a camera without any knowledge of their motion. 
The main idea is to consider the dual of the 3-D surface and the family 
of dual curves of the silhouettes over all viewing directions. Occluding 
contours correspond to planar slices of the dual surface. We introduce 
an affine-invariant representation of this surface that can constructed 
from a sequence of images and allows an object to be recognized from 
arbitrary viewing directions. We illustrate the proposed object represen- 
tation scheme through synthetic examples and image contours detected 
in real images. 



1 Introduction 

Most approaches to model-based object recognition are based on establishing 
correspondences between viewpoint-independent image features and geometric 
features of object models 0IE|. For objects with smooth surfaces, few sur- 
face markings and little texture, the most reliable image feature is the object’s 
silhouette, i.e., the projection into the image of the curve, called the occlud- 
ing contour, where the cone formed by the optical rays grazes the surface HH. 
The dependence of the occluding contour on viewpoint makes the construction 
of appropriate feature correspondences difficult. Appearance-based methods do 
not rely on such correspondences, and they are suitable for recognizing objects 
bounded by smooth surfaces, but they generally require a dense sampling of the 
pose/illumination space to be effective [Ej. Methods for relating image features 
to the 3-D geometric models of curved surfaces have been developed for surfaces 
of revolution ^uni, classes of generalized cylinders and algebraic 
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surfaces [I11IHIE01E2- Two limitations of these approaches are that there must 
be some means to obtain the 3-D model, and more critically, that only a limited 
number of objects are well approximated by a single primitive. 

An alternative is to replace the explicit description of the entire surface of 
the object of interest by the representation of some relatively sparse set of fea- 
tures directly useful for recognition. It is possible to represent either the 3-D 
geometry of these features HD] or to derive invariants from the image coordi- 
nates of detected features m- In these two methods, objects are modeled from 
a sequence of images obtained as a camera moves over a trajectory, yet they 
can be recognized from novel viewpoints; in cm the camera motion must be 
known whereas the motion is not needed in m- The setting considered in this 
paper generalizes m which only considered a sparse set of features of the sil- 
houette (inflections, bitangents, parallel tangents), and therefore offered limited 
discriminatory power. In contrast, the proposed approach uses nearly the en- 
tire silhouette for recognition. This study builds on geometric insights about the 
occluding contour and silhouettes of smooth surfaces sen, and their use in 
determining structure from sequences of images p E El 0 urn 

The basic processing steps for each image include detecting the silhouette 
curve, computing its dual, and then computing an HD-curve (high-dimensional 
curve) which is invariant to rigid transformations. When modeling an object 
from a camera moving over a trajectory of viewpoints, these HD-curves sweep 
out a surface (an HD-surface) which could have been computed directly from 
the object’s 3-D geometry if that geometry were available. Each object is then 
represented by an HD-surface. During recognition in a single image from a novel 
viewpoint, an HD-curve can be computed, and this HD-curve should lie on the 
HD-surface of the corresponding object. The relation of these geometric enti- 
ties are shown in Figure P and we now define these entities and discuss their 
properties in the subsequent sections. 



S : 2D Surface — » 


s' : 2D Surface 


— > Quotient HD-surface 


P — sur f ace 




H D — sur f ace 


1 V viewing direction 


J planar intersection 


Te 


C : ID curve — > 


VC : ID curve 


— > Quotient HD-curve 


P — curve 




HD— curve 



Fig. 1. This diagram summarizes the relation of the original curves, surfaces, pedal 
curves, pedal surfaces, HD curves and the HD surfaces. 



2 Duals of Curves and Surfaces 

We start the development by defining some standard geometric concepts about 
curves, surfaces and duals, which can be found in mn 



786 



D. Renaudie, D. Kriegman, and J. Ponce 




Fig. 2. Pedal curve and S(s) construction. 



Consider a planar curve a : / C ^ parameterized by its arc length s. 
At each point of the curve, T(s) = 0 !'(s) and N(s) will denote the (unit) tangent 
and normal vectors of the curve. 



2.1 Dual and Pedal Curves 

The dual of a point on a curve is the tangent line to the curve at that point, and 
over the entire curve a, the dual is also a curve. 

Definition 1 (Dual curve). 

DCurve{a) = {(^N{s), —N{s) ■ a(s)), s G 1} (1) 

Since N{s) is a unit vector, a point on a dual curve lies on a cylinder. Be- 
cause visualizing dual curves (and later dual surfaces) is sometimes difficult, we 
consider the pedal curve which is embedded in IR^ (respectively IR^ ) rather than 
on a cylinder. 

Definition 2 (Pedal curve P- Curve). 

PedalCurve{a) = {5{s)^ s G 1} (2) 

where 5 \ I ^ W' is defined by 

6{s)={N{s)-a{s))N{s). (3) 

In other words, the pedal curve is the set of points swept by the tip of the 
unit normal scaled by the (signed) distance between the origin and a curve point 
as this point varies across the curve. Figure El shows how i5(s) is defined, and 
Figure El shows an example of a closed planar curve and its pedal curve. A 
disadvantage of considering pedal curves over the duals themselves is that pedal 
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Fig. 3. Initial curve and its corresponding Pedal curve. 



curves are “ramified at the origin”, that is a circle of tangent lines to a(s) passing 
through the origin map to the origin of the pedal curve 

The following useful properties of pedal curves are readily derived from the 
definitions. 

Property 1. There is a one-to-one mapping between the dual and the pedal curve. 

Property 2. An inflection of the initial curve a maps onto a cusp of the pedal 
curve. 

Property 3. A set of p points of the initial curve a with a common tangent line 
maps onto a point of multiplicity p of the pedal curve. 

For example, bitangent lines on a curve map onto crossings of the pedal 
curve. 

Property 4- A set of p points of the initial curve with the same tangent direction 
maps onto collinear points of the pedal curve, and these points are collinear with 
the origin. Reciprocally, the intersection points between a line passing through 
the origin and the pedal curve correspond to parallel tangent lines of the initial 
curve. 

Consider for example the curve and its pedal in Figure 0 the cusps in the 
pedal correspond to the two inflections while the crossing corresponds to the 
common horizontal tangent line at the top of the heart. 

It is important to note that the pedal curve depends on the choice of the 
coordinate system in which the initial curve a is described. This is significant 
because the curves we will consider are image contours, and a rigid image-plane 
transformation of the object will lead to a geometric change to the pedal curve. 
Nonetheless, Properties 1 to 4 are independent of rigid transformations, and we 
will subsequently define a mapping from the pedal curve to another curve which 
is invariant to rigid transformations of the original curve a. 
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2.2 The Pedal Surface 

The dual of a point on a surface is a representation of the tangent plane, and 
the concept of pedal curve can be extended to non-singular surfaces in . Let 
cr : {7 C ^ be a parameterization of a surface S defined by We 

define the 3D P-surface of S as follows. 

Definition 3 (Pedal Surface — P-Surface). The pedal surface associated with 
the parameterized surface a : U ^ IS? is the parameterized surface S : U — > 
defined by 6{s,t) = (iV(s,t) • a{s,t))N(s,t). 

3 Occluding Contours and Silhouettes 

We consider an object bounded by a non-singular surface S with parameteriza- 
tion (T, and assume that images are formed under scaled orthographic projection 
(weak perspective). 

Given a point P € and a camera whose origin is in O and whose image 
plane is spanned by the orthogonal vectors i and j, the coordinates (7f, T) of 
the projected point are given by: 



where (O, i,j) is the camera’s coordinate frame, and x is the inverse of the depth 
of some reference point. The vector v — i x j is the viewing direction. For pure 
orthographic projection, x is taken to be constant and without loss of generality 
we shall choose x = 1. 

Let u be a viewing direction in , the occluding contour is the set of points 
in that lie on the surface S and whose normal vectors are orthogonal to the 
viewing direction. The occluding contour is generally a nonplanar curve on S, 
and obviously depends upon the viewpoint HH. The silhouette is the projection 
of the occluding contour into the image. Note that this definition treats the 
objects as being translucent and does not account for self-occlusion of sections 
of the occluding contour by other portions of the object. 

4 Links between P-Curves and P-Surfaces 

We now consider the relation between pedal curves of the silhouettes for some 
viewing direction and the pedal surface. We denote by v the viewing direction, 
and by OccCont the corresponding occluding contour. 

Property 5. The occluding contour maps onto a curve given by the intersection of 
the pedal surface with the plane w^ passing through the origin and perpendicular 
to the viewing direction v. 




(4) 
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This property leads us to consider the links between the P-curve of a silhou- 
ette and some planar slice of the P-surface. Let us consider our surface S and 
its corresponding P-surface S' . Now choose some viewing direction v; it defines 
an occluding contour OccCont and a silhouette Sil under pure orthographic 
projection. Let A' denote the projection of the origin of the object’s coordinate 
system used to define S' . We have the following theorem: 

Theorem 1. The P-curve of the silhouette Sil with respect to the point A has 
the same shape as the intersection (planar slice) of the P-surface S' with the 
plane u*'- . 

Proof. Property 0 says that the P-surface of the occluding contour is a plane 
curve, entirely embedded in the affine plane . Since the silhouette is obtained 
by pure orthographic projection of the occluding contour onto the image plane, 
the tangent planes to S at the points of the contour are projected into the 
tangent lines of the silhouette. Thus computing the P-curve of the silhouette 
with respect to A' is exactly equivalent to taking the planar intersection of the 
P-surface with the plane . □ 

This property leads to the following critical insight. Consider a camera mov- 
ing over a trajectory during modeling such that the viewing direction v(t) is 
known. Then from the previous property, one could reconstruct a subset of the 
object’s P-surface from the P-curves if A'{t) can be determined. If for exam- 
ple, v(t) covers a great circle, then the entire P-surface would be reconstructed. 
Furthermore, for some other viewing direction v not in v(t), the corresponding 
P-curve would simply be a planar slice of the reconstructed P-surface. 

4.1 A Representation of the Pedal Curve and the Pedal Surface 

The previous properties could form the basis for a recognition system in which 
smooth surfaces are modeled from a sequence of images with known camera mo- 
tion, and then objects could be recognized from an arbitrary viewing directions 
V. However, this requires establishing correspondence of A' (t) during modeling 
and detecting the projection of the 3-D origin A during recognition; we know 
of no properties for doing this. Instead, we now define a representation (HD- 
curves and HD-surfaces) that is invariant to the choice of the origin (and in 
fact affine image transformations) and furthermore eliminates the necessity for 
knowing the camera motion v(t) during modeling. In particular, we take advan- 
tage of the previously mentioned invariance of properties 1 to 4 to rigid plane 
transformations . 

Definition 4 (Signature). Given a planar curve VC parameterized by 5 \ I ^ 
and a point C (called the scanning center^, we call the signature of the curve 
VC with respect to a point C the couple {{Oi, • • • , 9p), (di, • • • , dp)) such that: 

(V ErJii0k+i-ek) = 27T 

(2) \/kGi---p-i \/0 G[ 0k,ek+i] #{v{c,0 )nvc) = dk + i 
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where T>{C, 9) denotes the straight line passing through C with orientation 9, 
and denotes the number of elements of the set E. 

The signature partitions the plane into angular sectors centered at C and 
bounded by 9k,9k+i such that the number of intersections between any line 
passing through C and the curve VC is constant within each sector. The critical 
points where the number of intersection changes is given by the singular points 
of the pedal curve (See Properties 0 and 0) and points where the tangent to 
pedal curve pass through the C. 

of the inflection 

We now define the HD-curve of a planar curve VC with regards to a point C 
as the multi-dimensional set of curves {ElDCompi, ■ ■ ■ , HDCompp) given by: 

Definition 5. Vk G 1 ■ ■ ■ p, HDCompk is a curve embedded in IR^'= with coor- 
dinates {yi = a ;2 - xi,- • • ,2/dfc = Xdk-\-i ~ xi) with (si,-- - ^Xd^+i) being the 
abscissas of the intersection points between the oriented line T>{C,9) and VC. 

In other words, in each angular sector k defined previously, we take the dk + ^ 
intersection points between VC and lines passing by C, order these intersection 
points by relative abscissas on the line, and compute the dk differences by the 
first (thus the smallest) abscissa. 

We now consider an important property of the HD-curve that makes them 
useful for modeling and recognition. Consider an initial silhouette curve C, pa- 
rameterized by a : / — *■ . Now compute its corresponding P-curve VC, param- 

eterized by (5 : / ^ K^, constructed relative to the origin of the image plane. 

Property 6. The HD-curve of VC of C is invariant with respect to the choice of 
the origin of C and any rigid transformation of C. 

The important thing here is the fact that the P-curve gathers all parallel tan- 
gency and inflection information under an easily exploitable form but its shape 
depends upon the choice of the origin. By choosing the HD-curve scanning cen- 
ter as being the same point as the P-curve origin, this dependence on the choice 
of the origin is suppressed. Here is another perspective: consider the silhouette 
curve C; its tangency features (inflection points, parallel tangents...) are intrinsic 
properties of the curve. For example, consider a set of four points on a curve 
with parallel tangents; they will be mapped to four collinear points of the P- 
curve. Now, consider the relative distances between these four aligned points. 
They correspond to the relative distances between tangent lines to the initial 
curve, thus it is not surprising that we succeed in eliminating the dependency 
with respect to the P-curve center. 

In a similar manner, one can define an HD-surface derived from the P-surface. 
Unlike the P-surface whose shape depends upon the choice of the origin, the 
HD-surface is invariant to rigid transformations of the original surface. Based 
on Property [D and its implications as well as the invariance property of HD- 
curves to rigid transformation, the entire HD-surface can be determined from 
the sequence of silhouettes formed when the viewing direction covers a great 



Duals, Invariants, and the Recognition of Smooth Objects 791 




\ 



\ 



Fig. 4. Contour detected by tresholding and the computed pedal curve. 



circle. Hence, the HD-surface can serve as a representation for recognition, and 
it can be constructed without knowledge of the camera motion. 

Finally, under weak perspective, the scale x(t) is a function of time. In con- 
structing the HD-curve components, we can treat the dk coordinates as homo- 
geneous coordinates or normalize the coordinates as {z\ = yi/ydkr ' ' i^dk-i = 
ydk-i/ydk) with (?/i, • • • ,ydk) being the coordinates of points on HDCompk of 
the HD-curve. This creates a curve embedded in rather than in , and 

we call this the Quotient HD-curve. The Quotient HD-curve of VC is invariant 
with respect to any homothetic transformation of the initial silhouette curve C 
in the image plane. Over a sequence of images, the family of Quotient HD-curves 
sweeps out a Quotient HD-surface. 

5 Implementation 

Here, we present some details of our Matlab implementation for computing the 
geometric entities described previously. 

5.1 Image Processing 

For each image of the sequence, we detect the silhouette of the object using 
thresholding to obtain a connected curve, i.e. a discrete list of all successive 
points on the contour. 

The discrete contour extracted by thresholding introduces artifacts due to 
aliasing, leading to poor results when used in the subsequent steps of the algo- 
rithm. Consequently, we recursively smooth the contour using Gaussian filters 
parameterized by arc length m- First, the arc length to each point on the con- 
tour is computed. For each point afto) on the original contour, we obtain new 
X and y coordinates according to: 



Vto G 0 .. .t. 



Smoothed -a{to) 



fto-C4a 

to-4<j 



G{to — t)a{t)dt 



( 5 ) 



•max 




G{to — f)dt 



where 




( 6 ) 
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Fig. 5. A contour after recursively applying the Gaussian filter with a = 2.0 five times 
and the computed pedal curve. 



Since the curve is discrete ot{t)t£Q...tma^^ approximate the above integral 
using a trapezoid method with irregularly spaced interpolation points. 

Example 1. Figure 0 shows the initial contour and its corresponding P-curve, 
and Fig. 0 shows the results after five consecutive smoothings. 

5.2 Pedal Curve Computation 

Given the discrete form of the possibly smoothed curve, we compute the normal 
vector for each point on the contour. For this, we use linear least squares to esti- 
mate of the direction of the tangent. Then Equation 0can be directly applied to 
compute the pedal curve; this phase is very quick once the normals are available. 
The only difference with Eq. Elis that the contour is parameterized by a discrete 
parameter instead of a continuous one. 

5.3 HD-Curve Computation 

The method for computing the HD-Curve follows: 

1. An oriented line constitutes the reference line of the angles. For reasons 
described in Section tf. 11 we choose for scanning center the same point that 
was used for computing the P-Curve, and in practice, the origin of the angles 
is the horizontal, right-oriented-line passing through the scanning center. 
Changing the origin of the angles only cycles the storage of the points of the 
HD-Curve, but it does not affect it otherwise. 

2. We regularly sample [0,27 t] and obtain the angles {0i}igi...Ar 

3. For each of the sampled angles Op. 

— Consider the line passing through the scanning center and having an 
angle 9i with the reference line, counted anti-clockwise. 

— Calculate the signed abscissas (regarding this line) of the intersection 
points of this line and the P-Curve, and sort them. 

— Calculate the distances between these intersection points, i.e. the differ- 
ences between the previously calculated abscissas and the first abscissa 
(the smallest). 

— Store these distances. 
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Fig. 6. A heart, a squash and a banana. 





Fig. 7. A single image of the squash and its corresponding HD-Curve and Quotient 
HD-Curve. 



6 Results 

We demonstrate these algorithms with three series of images: a heart, a butternut 
squash and a banana (Fig. EJ, rotating about a vertical axis. While the heart 
images are synthetic, please note that the squash and banana images have been 
gathered with a camera and real fruits and vegetables. 

6.1 An Example of HD-Curve and Quotient HD-Curve 

We consider a single image of the squash, and the HD-Curve and Quotient HD- 
Curve that have been extracted from this image. More precisely, Fig. 0 shows 
only the 3D part of the HD-Curve and the 2D part of the Quotient HD-Curve, 
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Fig. 8. Images of the heart. 

since they are more easily representable than higher dimensional curves. Note 
that in this particular case, the HD-Curve had only a ID and a 3D components, 
but the banana, which is a more “complex” object, has a 5D component that 
we cannot readily draw. 

Note that for any translation and/or rotation of the initial image, the HD- 
Curve (thus also the Quotient HD-Curve) remain unchanged, and any zoom or 
definition change leaves the Quotient HD-Curve unchanged. 

6.2 HD-Surfaces 

Below we show the result of computing the HD-surfaces for the heart, squash 
and banana. Please note that we have drawn sample points on the HD-surfaces 
rather than a rendering of an interpolated surface. 



The Heart Sequence Figure 0 shows a series of five artificially generated 
images (115x115) of the heart, rotating along a vertical axis, and FigureElshows 
the corresponding computed HD-Surface (3D part only) and the Quotient HD- 
Surface (2D part only). 



The Squash Sequence. Figure ITU shows a series of five real images (300x300) 
of a real squash, and Figure CH shows the corresponding HD-surfaces. 



The Banana Sequence Finally, Figure d shows a series of four images of a 
banana, and Figure IT^ shows the corresponding HD-surfaces. Please note that 
the angle of rotation over the four images is significant (over 120 degrees), and 
so four images yields a very sparse sampling of the HD-surfaces. 

7 Conclusion 

In this paper, we have introduced a new relation between the dual of a smooth 
surface and the dual of the silhouettes formed under orthographic projection. 
While the dual (or pedal) curve/surface depends upon the choice of the origin, 
the HD-curves (HD-surfaces) are invariant to rigid transformations. We have 
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HD-Curves tor a sat of Images of the same ob|ect, rotating along a vertical axe. (1 1 Sx1 1 5) 







Image View 




Image View : 




Image View ; 




Image View ' 
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Image View: 




Fig. 9. Sample points on HD-Surface and Quotient HD-surface computed from five 
images of the heart. 



illustrated how these geometric entities can be computed from synthetic and real 
images, and have discussed how they could be used for object recognition. Unlike 
nearly all methods for recognizing smooth curved 3-D objects, this technique 
does not assume that objects come from a limited class such as surfaces of 
revolution, generalized cylinders or algebraic surfaces. 

Note that we have not yet incorporated this representation within an object 
recognition system. An important issue to be addressed will be the combinatorics 
of determining whether a measured HD-curve lies on a model HD-surface when 
there is occlusion during either the modeling or recognition phase. 

The basis for this method is that the set of points on an object’s surface 
with parallel tangent planes project under orthographic projection to image 
curve points with parallel tangent lines. Between a test image and each model 
image, the HD-curves and HD-surfaces are essentially being used to identify 
candidate stereo frontier points [Zl ■ This paper generalizes our earlier results m 
which defined invariants from silhouette tangent lines that were parallel to the 
tangent lines at inflections or bitangents. It turns out that these features are the 
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Fig. 10. Five images of the squash. 



HD-^rface (SO pail) cf the squash's video sequence 




Fig. 11. Squash’s HD-Surface (3D part) and Quotient HD-Surface (2D part). 



singularities (cusps and crossings) of the dual of the silhouette. In |23, this lead 
to representing an object by a set of “invariant curves” while in this paper, an 
object is represented by a set of “invariant surfaces,” namely the HD-surfaces. 
Consequently, the presented representation retains much more information about 
the object’s shape and should provide greater discriminatory power than the 
curves used in m- 
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Abstract. In principle, the recovery and reconstruction of a 3D object 
from its 2D view projections require the parameterisation of its shape 
structure and surface reflectance properties. Explicit representation and 
recovery of such 3D information is notoriously difficult to achieve. Al- 
ternatively, a linear combination of 2D views can be used which requires 
the establishment of dense correspondence between views. This in gen- 
eral, is difficult to compute and necessarily expensive. In this paper we 
examine the use of affine and local feature-based transformations in es- 
tablishing correspondences between very large pose variations. In doing 
so, we utilise a generic-view template, a generic 3D surface model and 
Kernel PCA for modelling shape and texture nonlinearities across views. 
The abilities of both approaches to reconstruct and recover faces from 
any 2D image are evaluated and compared. 



1 Introduction 

In principle, the recovery and reconstruction of a 3D object from any of its 2D 
view projections requires the parameterisation of its shape structure and surface 
reflectance properties. In practice, explicit representation and recovery of such 
3D information is notoriously difficult to achieve. A number of shape-from-X 
algorithms proposed in the computer vision literature can only be applied on 
Lambertian surfaces that are illuminated through a single collimated light source 
and with no self-shadowing effects. Atick et al. [1] have applied such a shape- 
from-shading algorithm to the reconstruction of 3D face surfaces from single 2D 
images. In real-life environments, however, these assumptions are unlikely to be 
realistic. 

An alternative approach is to represent the 3D structnre of objects, such 
as faces, implicitly without resorting to explicit 3D models at all [3, 14, 15, 17]. 
Snch a representation essentially consists of multiple 2D views together with 
dense correspondence maps between these views. In this case, the 2D image 
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coordinates of a point on a face at an arbitrary pose can be represented as a 
linear combination of the coordinates of the corresponding point in a set of 2D 
images of the face at different poses provided that its shape remains rigid. These 
different views span the space of all possible views of the shape and form a vector 
space. The shape of the face can then be represented by selecting sufficient local 
feature points on the face. Such representation requires the establishment of 
dense correspondence between the shape and texture at different views. These 
are commonly established by computing optical flow [3, 19]. In general, a dense 
correspondence map is difficult to compute and necessarily expensive. Besides, an 
optical flow field can only be established if the neighbouring views are sufficiently 
similar [3]. 

One can avoid the need of dense correspondence by considering a range of 
possible 2D representation schemes utilising different degrees of sparse corre- 
spondence. In the simplest case, transformations such as translation, rotation 
and uniform scaling in the image plane can be applied to a face image to bring 
it into correspondence with another face image. Such transformations treat im- 
ages as holistic templates and do not in general bring all points on the face 
images into accurate correspondence. This transformation results in a simple 
template-based representation that is based only on the pixel intensity values of 
the aligned view images and does not take into account the shape information 
explicitly. Such representation, for example, was used by Turk and Pentland to 
model Eigenfaces [16]. 

Alternatively, a local feature-based approach can be used to establish cor- 
respondences only between a small set of salient feature points. Correspon- 
dences between other image points is then approximated by interpolating be- 
tween salient feature points, such as corners of the eyes, nose and mouth. In 
Active Appearance Models (AAM) Cootes et al. bring two views into alignment 
by solving the correspondence problem for a selected set of landmark points [4] . 
The face texture is then aligned using a triangulation technique for 2D warp- 
ing. In AAM, however, correspondences can only be established between faces 
of similar views. 

Ultimately, modelling view-invariant appearance models of 3D objects, such 
as faces across all views relies on recovering the correspondence between lo- 
cal features and the texture variation across views. This inevitably encounters 
problems due to self occlusion, the non-linear variation of the feature positions 
and illumination change with pose. In particular, point-wise dense correspon- 
dence is both expensive and may not be possible across large view changes since 
rotations in depth result in self occlusions and can prohibit complete sets of 
image correspondence from being established. However, the template-based im- 
age representation such as [6, 16] did not address the problem of large 3D pose 
variations of a face. Recognition from certain views is facilitated using piece- 
wise linear models in multiple view-based eigenspaces [9]. Similarly Cootes et 
al. [4] do not address the problem of non-linear variation across views and aimed 
only at establishing feature-based correspondences between faces of very similar 
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views. In this case, small degrees of non-linear variations can also be modelled 
using linear piece- wise mixture models [5]. 

Romdhani et al. [10,11] have shown that a View Context-based Nonlinear 
Aetive Shape Model by utilising Kernel Principal Components Analysis [13, 12] 
can locate faces and model shape variations across the view-sphere from profile 
to prohle views. This approach is extended here in a Face Appearance Model 
of both Shape and Texture across views. We introduce two different methods in 
establishing correspondences between views. The first method uses afhne trans- 
formation to register any view of a face with a generic view shape template. An 
alternative feature-based approach is examined that utilises a generic 3D sur- 
face model. We present the two approaches and examine the ability of the two 
correspondence methods to reconstruct and recover face information from any 
2D view image. 

In Section 2 of this paper, we introduce a generic-view shape template model 
and a generic 3D surface model to be used for establishing feature-based cor- 
respondences across poses. A Pose Invariant Active Appearance Model using 
Kernel PCA is discussed in Section 3. In Section 4, we present experimental 
results and comparative evaluations before we conclude in Section 5. 

2 Feature Alignment Across Very Large Pose Variations 

Accurately modelling the texture of an object requires the corresponding fea- 
tures to be aligned. However, achieving this geometric normalisation across views 
under large pose variation is nontrivial. Cootes et al. [4] and Beymer [2] both 
align the features of face images on the mean shape of a hxed pose. While this 
technique is valid when dealing with faces at the same or very similar pose, it is 
clearly invalid for faces which vary from profile to prohle views as illustrated in 
Fig. 1. 




Fig. 1. Left: Examples of training shapes. Right: The average shape, and the average 
shape overlapping the frontal view. 



A new correspondence and alignment method is required that must address the 
following issues: 

1. Due to self occlusion, some features visible at one pose are hidden at another 
pose (e.g. the left eye is hidden from the left prohle). This problem can be 
addressed possibly by two methods: (a) Hidden features can be made explicit 
to the model without regenerating their texture by utilising a generic-view 
shape template and establishing affine correspondence between views, (b) 
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A generic 3D face surface model can be utilised to establish feature-based 
correspondence. The hidden features are regenerated using the information 
of the visible features, based on the bilateral symmetry of faces. 

2. Pose change is caused by head rotation out of the image plane. This means 
that the features’ positions vary nonlinearly with the pose and a feature 
alignment algorithm must be able to cope with nonlinear deformations. We 
use Kernel PC A to model nonlinear deformations of both the shape and the 
texture of a face across pose. 

In this paper we discuss two correspondence and alignment methods and 
evaluate their ability to recover and reconstruct faces from any 2D image. The 
first alignment technique establishes affine correspondences between views by 
utilising a generic-view shape template. The second approach uses a local feature- 
based approach to establish correspondences between views by utilising a generic 
3D surface model. 

First let us define some notations. A shape X is composed of a set of Ng 
landmark points and the texture v of a set of Nt grey- level values Vii 

X* = (Xi, yif, X = (xi, . . . , XNsf, V = (wi, . . . , VNt) (1) 

The shape X of any single view is composed of two types of landmark points: 

(a) Xout, the outer landmark points which define the contour of the face and, 

(b) Xi„, the inner landmark points which define the position of the features such 
as mouth, nose, eyes and eyebrows. 

In particular, 25 outer landmark points define the contour of the face and 55 
inner landmark points define the position of the features such as mouth, nose, 
eyes and eyebrows. The landmarks that correspond to salient points on the faces 
are placed manually on the training images whereas the remaining landmarks are 
evenly distributed between them. This is illustrated in Fig. 2. First, landmarks 
A, B, C, D and E that are selected to correspond to points on the contour at the 
height of the eyes, the lips and the chin-tip were set. Then the remaining outer 
landmarks are distributed evenly between these 5 points. Note that as the view 
changes the positions of the outer landmarks change accordingly. 




Fig. 2. Shapes overlapping faces at pose —30°, 0° and 30°. The salient outer land- 
marks (in white) are first manually set then the other outer landmarks (in black) are 
distributed evenly between the salient ones. 
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2.1 2D Generic- View Shape Template based Alignment 



The 2D Generic-View Shape Template Alignment method uses afhne transfor- 
mations to establish correspondences across very large pose variations. It utilises 
a generic-view shape template, denoted by Z, on which the landmark points of 
each view are aligned. The generic-view shape template Z is computed based on 
M training shapes and the following alignment process: 

1. The training shapes X of each view are scaled and aligned to yield shape X: 



(X-Xfc) 

ll^fe - X;|j 



( 2 ) 



where k refers to the landmark located on the nose-tip and I to the chin-tip. 

2. These aligned shapes are superimposed. 

3. The resulting generic-view template shape is formed by the mean of the inner 
landmark points and the extreme outer landmark points: 



Zin — , , 'y 



M 



M 



i=l 



^out,j — '^out 

V i = = l,...,Ns 

Z = 7i(Zout, Zi„) 



( 3 ) 

( 4 ) 

( 5 ) 



where Xi j ^ Zout is true if the point j- is not included in the area of the 
shape Zout and H{-) is the operator which concatenates an outer shape and 
an inner shape, yielding a complete shape. 

The process for creating the generic-view shape template is illustrated in 
Fig. 3. 




Fig. 3. Left: Left profile, frontal view and right profile shapes aligned with respect to 
the nose-tip. Right: A generic- view shape template which includes a set of inner feature 
points as illustrated. 



To align the shape and the texture to the generic- view shape template, a fast 
affine transformation is applied. Examples of aligned textures at different poses 
is shown in Fig. 4. 

To utilise the generic-view shape template, all feature points including the 
hidden features are made explicit to the model all the time: A special value of 
grey- level is used to denote hidden points (0 or black). 
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Fig. 4. Example of aligned textures at different poses 



In addition, the initial alignment performed is coarse and is exact only for 
the nose-tip. The other features are only approximately aligned as illustrated in 
Fig. 5. The z axis of this 3D graph is proportional to the distance covered by 
a landmark point on the aligned shape as the pose vary from profile to profile. 
Ideally this distance for all landmark points should be null, as it is for the nose- 
tip. Once the initial bootstrapping of the texture alignment is performed. Kernel 
PCA is applied to minimise the error of the aligned shape X and the generic-view 
shape template Z. 




Fig. 5. Variance of the error for inner features alignment across pose. The z axis repre- 
sents the distance covered by a landmark point as the pose vary from profile to profile 
relative to the face width. 



2.2 3D Generic Surface Model based Alignment 

We introduce a second feature alignment technique based on a generic 3D surface 
model shown in Fig. 6. It is composed of facets and vertices and constructed using 
the average of the 3D surface of training faces. A feature-based approach is used 
to establish correspondence between the 3D model and the 2D image views of 
a face. Landmarks on the 3D model are placed in the same manner to that of 
the face images described earlier. In total 64 facets are selected to correspond to 
the 64 landmarks registered on the images (the eyebrows’ landmarks were not 
used for the alignment). The inner landmark points are placed in facets that 
correspond to features such as eyes or nose, whereas the outer landmark points 
are placed on facets that correspond to the extreme outer boundaries of the face 
model. Examples of the outer landmark points placed on the 3D model is shown 
in Fig. 6. A property of the outer landmark points of the generic 3D surface 
model is that their position can vary to outline the outer boundaries of any 2D 
projected view of the 3D model. 
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Fig. 6. Examples of landmarked facets on the generic 3D surface model. 



The feature-based alignment algorithm used to establish the correspondences 
between the 3D surface model and a 2D face image is outlined in Fig. 7, and 
consists of the following steps: 



1. 3D Model Rotation: First the generic 3D surface model is rotated to 
reflect the same pose as that of the face in the image. 

2. 3D Landmarks Recovery: The position of the landmarks on the 3D 
generic model relative to the rotated pose are examined in order to deter- 
mine: (a) which inner landmark points are visible at this pose and therefore 
can be used for alignment and, (b) the new position of the outer landmark 
points so that the current visible outer boundary of the generic 3D model is 
outlined. This process ensures that the landmarks on the generic 3D model 
correspond to the face image landmarks at that pose. 

3. 2D Projection of the Generic 3D Model: Once the new position of the 
landmark points of the 3D model has been established, the 2D projection of 
the generic 3D model at that pose is computed. 

4. 2D Texture Warping: A triangulation algorithm is used to warp the face 
image on the 2D projection of the 3D model using the landmarks recovered 
at step 2. 

5. Hidden Points Recovery: The grey level values of the hidden points are 
recovered using the bilateral symmetry of faces. 

6. Aligned Texture: Our aligned texture is a flattened representation of the 
3D texture. 



Examples of alignment and reconstruction using the generic 3D surface model 
are shown in Fig. 8. The difference of texture between the visible region and the 
hidden region is often contrasted due to the lighting conditions. It can be noted 
that the aligned profile view of an individual is different from the aligned frontal 
view of the same individual. A more accurate alignment can be obtained using 
a 3D model containing more facets and higher resolution images. 
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Fig. 7. Overview of the algorithm for aligning the texture of a face based on its shape 
and its pose using a landmarked 3D model. After rotation of the 3D model, its land- 
marks are adjusted: the hidden inner points (3 landmarks on the bridge of the nose, 
here) are dropped and the outer landmarks are moved to be visible. Then a 2D warping 
is performed from the image to the 2D projection of the 3D model. Next, the grey-level 
values of the hidden points are recovered using the symmetry of faces and the texture 
is projected onto 2D yielding a flattened representation of the texture. 
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Fig. 8. Example of alignment of face images at different poses using the 3D Generic 
Surface Model and their reconstruction. 



3 Pose Invariant Appearance Model using Kernel PCA 

Our process for constructing a pose invariant shape and texture model is illus- 
trated in Fig. 9. The shape is represented as a vector containing the Xi and yi 
coordinates of Ng landmarks augmented with the pose angle 0 as in the View 
Context-based Nonlinear Aetive Shape Model [10]: (xi, t/i, . . . , xn, Vn, 0)- Cootes 
et al. [4] used a linear PCA to model the shape and texture of faces. However, 
under very large pose variations the shape and texture vary nonlinearly despite 
our texture alignment. 

Kernel Principal Components Analysis (KPCA) [13] is a nonlinear PCA 
method, based on the concept of Support Vector Machines (SVM) [18]. Ker- 
nel PCA can also be regarded as an effective nonlinear dimensionality reduction 
technique which benefits from the same features of PCA. KPCA does not require 
more training vectors than normal PCA as opposed to mixture models. However, 
there is one major drawback of KPCA. The reconstruction of a vector from the 
KPCA space to the original space requires to solve an optimisation problem and 
it is computationally expensive [8] . 

Romdhani et al. [10,11] successfully used KPCA to model shape and vari- 
ations from profile to profile views. KPCA is also used to model the aligned 
texture. However, the combined shape and texture model is built with a linear 
PCA. This is because our experiments verified that the correlation between the 
shape and the texture is linear after KPCA has been applied to both shape and 
texture individually. 

As explained in Section 2, the model must be constructed using manually 
landmarked training images. The projection of a landmarked new face image to 
our model can be computed in a single step by computing (1) the projection of 
its shape (defined by its landmarks), (2) the projection of the underlying texture 
and (3) the projection of the combined shape and texture. However in the most 
general case a new face image does not possess landmarks. Hence a fitting algo- 
rithm is used which recovers the shape and computes the projection of a novel 
face. This is achieved by iteratively minimising the difference between the image 
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under interpretation and that synthesised by the model. Instead of attempting 
to solve such a general optimisation problem for each fitting, the similar nature 
among different optimisations required for each fitting is exploited. Hence, di- 
rections of fast convergence, learned off-line, are used to rapidly compute the 
solution. This results into a linear relationship between the image space error 
and the model space error. Before this linear relationship is learned by an SVD 
regression, a linear PCA is performed to reduce the dimensionnality of the image 
space error and ease the regression. The iterative fitting algorithm described in 
the following is similar to that used by Cootes et al. [4]: 

1. Assume initial shape and pose. In the next Section we will detail the con- 
straints set on the starting shape and pose for the algorithm to converge. 

2. Compute a first estimation of the projection using the shape and pose from 
step 1 and the texture of the image underlying the current shape. 

3. Reconstruct the shape (along with its pose) and the aligned texture from 
the current projection. 

4. Compute the image space error between the reconstructed aligned texture 
obtained in step 3 and the aligned texture of the image underlying the re- 
constructed shape obtained in step 3. 

5. Estimate the projection error using the image space error computed in step 4 
along with the known linear correlation between the image space error and 
the model space error computed off-line. This projection error is then applied 
to the current projection. 

6. Go back to step 3 until the reconstructed texture does not change signifi- 
cantly. — 



1 




Appearance (+)^ 



Fig. 9. An algorithm for constructing a Pose Invariant AAM. The projection and back- 
projection to and from the model are outlined in plain line. The generation of model 
parameters for a novel image for which the shape is unknown (the model htting process) 
is outlined in dashed line. 
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4 Experiments 

To examine and compare the ability of the two approaches to reconstruct and 
recover faces from any 2D image views we use a face database composed of 
images of six individuals taken at pose angles ranging from —90° to +90° at 10° 
increments. During acquisition of the faces, the pose was tracked by a magnetic 
sensor attached to the subject’s head and a camera calibrated relative to the 
transmitter [7]. The landmark points on the training faces were manually located. 
In the case of the generic 3D surface model we used a 3D surface model provided 
by Michael Burton of the University of Glasgow. We trained three Pose Invariant 
AAM (PIAAM) on the images of faces of six individuals at 19 poses. The first 
PIAAM used a 2D generic-view shape template, the second a generic 3D surface 
model containing 3333 facets and the third a generic 3D surface model containing 
13328 facets. In the three cases ten, fourty, and twenty eigenvectors were retained 
to describe the shape, the texture and the combined appearance respectively. 

4.1 Face Reconstruction 

Fig. 10 shows examples of reconstruction of the three PIAAM when the shape 
and pose of the faces is known. The PIAAM using a generic 3D surface model 
containing 3333 facets exhibits a “pixelised” effect. The accuracy of the recon- 
struction of the PIAAM using a 2D generic-view shape template and of the 
PIAAM using a generic 3D surface model containing 3333 facets is similar while 
that of the PIAAM using a generic 3D surface model containing 13328 facets is 
superior. However, the experiments of the next section show that 3333 facets is 
sufficient to produce a good fitting. 

4.2 Face Recovery Using the Pose Invariant AAM 

2D Generic-view Shape Template Fig. II illustrates examples of recovering 
the shape and texture of any 2D image view using a 2D generic-view shape 
template-based PIAAM trained on five individuals. 

While the shape (both pose and feature points) can be recovered adequately, 
this is not the case for texture. Whilst the pose of the face can be recovered 
correctly, the intensity information for all pixels is not always recovered. The 
reason for such effect is that the alignment is only approximate and the variation 
in the aligned texture due to the pose change overwhelms the variation due to 
identity difference. 

Generic 3D Surface Model Fig. 12 shows the recovery of faces from varying 
poses using generic 3D surface model-based Pose Invariant AAM. The model 
contained 3333 facets. The linear PCA used in the fitting regression was config- 
ured to retain 99% of information yielding 626 eigenvectors. The iterative fitting 
starts always from a frontal pose shape located near the face on the image and 
can recover the shape and texture of any 2D image face view. Each iteration 
takes about 1 sec. (on a normal Pentium II 333 MHz) and the convergence is 
reached after an average of 4 iterations. 
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Fig. 10. Example of reconstruction produced by three Pose Invariant AAM. The hrst 
image is the original image, the second, the third and the fourth images are its recon- 
struction yielded by the Pose Invariant AAM using the 2D generic-view shape tem- 
plate, the 3D generic surface model containing 3333 facets and the 3D generic surface 
model containing 13328 facets, respectively. The reconstructions are computed using 
the manually generated shape. 



4.3 On Model Convergence 

The AAM introduced by Cootes et al. requires a good starting shape to reach 
convergence [4] . That is, an estimation of the position of the face and of its pose 
must be known. Fig. 13 depicts the dependency on this requirement for the Pose 
Invariant AAM using the 2D generic-view surface template and the generic 3D 
surface model by showing the proportion of searches which converged for different 
initial displacement and pose offset. The 2D generic- view shape template-based 
PIAAM is very constrained by its initial pose and location : if the pose is known 
within 10° accuracy, it has 80% chances to reach convergence if the x offset is 
within 4 pixels. However, the generic 3D surface model-based PIAAM has 80% 
chances to reach convergence if the pose offset is within 50° and the x offset 
within 4 pixels (Note that the faces have average of 30 pixels in x). This is 
because the 3D surface model alignment is more accurate that the 2D generic- 
view shape template alignment. As expected, the better the pose is known, the 
lower the dependency on the estimation of the face location. 

5 Conclusions 

We have presented a novel approach for constructing a Pose Invariant Active 
Appearance Model (PIAAM) able to capture both the shape and the texture of 
faces across large pose variations from profile to profile views. We illustrated why 
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pose: -90°, 
init. pose: -50°, 
init. offset: (0, 0) 
shape error: (2.09, 0.76) 
pose: -50°, 
init. pose: 0°, 
init. offset: (-4, -3) 
shape error: (3.67, 0.82) 
pose: -90°, 
init. pose: 0°, 
init. offset: (-6, 0) 
shape error: (2.43, 0.97) 
pose: 90°, 
init. pose: 50°, 
init. offset: (-6, 0) 
shape error: (0.87, 0.65) 
pose: -40°, 
init. pose: 0°, 
init. offset: (0, -6) 
shape error: (6.62, 2.84) 



Fig. 11. Face recovery of a 2D generic-view shape template-based Pose Invariant AAM 
trained on five individuals. Each row is an example of texture and shape fitting. The first 
image is the original image, the following images are obtained at successive iterations. 
The penultimate image shows the converged fitting of both shape and texture and the 
last image overlaps the recovered shape on the original image. 

the key to effective Pose Invariant AAM is the choice of an accurate but also 
computationally viable alignment model and its corresponding texture represen- 
tation. To that end, we introduced and examined quantitatively two alignment 
techniques for the task: (a) A 2D Generic-view Shape Template using affine 
transformations to bootstrap the alignment before it is further refined by the 
use of Kernel PCA. (b) A Generic 3D Surface Feature Model using projected 
dense 3D facets to both establish local feature-based correspondence between fa- 
cial points across pose and recover the grey level values of those points which are 
hidden at any given view. Our extensive experiments have shown that whilst the 
reconstruction accuracy of the 2D generic-view template-based PIAAM is simi- 
lar to that of a generic 3D surface feature-based PIAAM using 3333 facets, the 
reconstruction performance of a generic 3D feature-based PIAAM using 13328 
is superior. Furthermore, good fitting was produced using a PIAAM based on a 
generic 3D surface model containing 3333 facets. On the other hand, the fitting 
of a 2D generic-view shape template-based PIAAM was shown to have a greater 
degree of dependency on the initial positions before fitting. 
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Fig. 12. Face recovery of a PIAAM using the generic 3D surface model containing 3333 
facets trained on six individuals. Each row is an example of texture and shape fitting. 
The first image is the original image, the followings images are obtained at successive 
iterations until convergence. Each fitting started from the frontal pose (CP). 
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Abstract. The problem of establishing correspondences between ima- 
ges taken from different viewpoints is fundamental in computer vision. 
We propose an algorithm which is capable of handling larger changes 
in viewpoint than classical correlation based techniques. Optimal perfor- 
mance for the algorithm is achieved for textured objects which are locally 
planar in at least one direction. The algorithm works by computing affi- 
nely invariant fourier features from intensity profiles in each image. The 
intensity profiles are extracted from the image data between randomly 
selected pairs of image interest points. Using a voting scheme, pairs of 
interest points are matched across images by comparing vectors of fou- 
rier features. Outliers among the matches are rejected in two stages, a 
fast stage using novel view consistency constraints, and a second, slower 
stage using RANSAC and fundamental matrix computation. In order to 
demonstrate the quality of the results, the algorithm is tested on several 
different image pairs. 



1 Introduction 

The problem of matching points between images taken from different viewpoints 
is fundamental in computer vision. 3D reconstruction, motion recovery, object 
recognition and visual servoing are some of the central problems which benefit 
from or demand point correspondences across images. 

The application which has motivated the approach presented in this paper 
is visual servoing for a gripper mounted on an autonomous robot. Such a robot 
should be capable of using its arm and eye-in-hand camera to find, recognise and 
pick up a large set of objects. 

In general, objects can only be stably grasped in a limited number of ways, 
depending on the shape of the object. Given that the robot knows how to grasp 
an object starting from a discrete set of reference positions relative to the object, 
all that is needed is to position the gripper in one of the reference positions. 

The key to positioning the gripper in the reference position is the epipolar 
geometry; if it is known, along with the internal camera parameters, the direction 
of translation and the rotation needed to bring the camera and the gripper to the 
reference position can be recovered. This well-known fact has been exploited by 
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several authors recently However, to compute the epipolar geometry, 

one needs point correspondences. 

In this paper, we present a new method of aquiring these. The method is of 
course not limited to visual servoing but can be used wherever point correspon- 
dences are needed. 

2 Related Work 

Recently, the wide baseline matching problem has received a lot of attention, 
due to the fact that classical correlation based matching algorithms, although 
successful 13 for small baselines, fail if the change in viewpoint is too large. 
However, having a wide baseline is desirable in many cases; in reconstruction, 
a wide baseline gives a more well-conditioned problem than a short baseline, in 
view based object recognition, fewer stored reference views of each object are 
required. 

Most approaches aiming to solve the wide baseline matching problem rely on 
matching interest points between images, the reason obviously being that point 
correspondences are needed to estimate for instance epipolar geometry. Interest 
points are often extracted using the Harris “corner” detector, and it will be used 
here as well. Specifically, we used the Harris implementation in the TargetJR 
software distribution. 

The manner in which the interest points are matched across images is what 
separates the different algorithms. The subsequent steps are often quite simi- 
lar; after having established an initial set of correspondences, most authors use 
robust techniques such as LMedS (Least Median of Squares) |S| or RANSAC 
(RANdom SAmple Consensus) 0 to eliminate mismatches and estimate the 
geometry robustly. 

One example of this approach can be found in the work by Pritchett and 
Zisserman jZj. They present two different methods. In the first one, they match 
quadrangular structures between two images. Given those seed matches, and 
assuming that interest points in the vicinity of each seed match are on a locally 
planar surface, they compute local homographies and find the interest point cor- 
respondences which are consistent with these homographies. By iterating, they 
can find point correspondences further and further away from the initial seed 
match. With the other method, they first try to approximately register the two 
images by finding the global similarity transformation which maximizes cross- 
correlation between the images at a course scale. Starting from the similarity 
transformation, they try to estimate local affine transformations at finer scales 
in smaller regions. Each local transformation is then used to find interest point 
correspondences. 

Another large class of matching methods relies on computing a vector of 
features for each interest point. The features should capture the local image 
information in a compact way, while being invariant to transformations of the 
camera and changes of illumination. Interest points are then matched by com- 
paring feature vectors. 
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One example of this approach, although focused on object recognition, is 
the work by Schmid and Mohr 0. They compute x- and y-derivatives of the 
image intensity up to the third order at each interest point and combine these 
in order to form features which are invariant to translations and rotatations of 
the image plane. To achive scale invariance, the features have to be computed at 
several scales using gaussian blurring. The drawback of this approach is that the 
features are not invariant to more general transformations, for instance affine 
transformations. The consequence is that many reference images are needed to 
achieve good performance. 

Tuytelaars and van Gool present an algorithm |3] which uses affinely invariant 
moments in small regions close to each interest point. The moments are based 
on all three colour bands of the images. They show how a large number of low- 
order independent moments can be computed for each interest point to form 
an invariant feature vector. The crucial part of this work is to find the same 
region for each interest point, no matter how the camera was moved or the 
illumination was changed. This is done by searching for the region which gives 
a local maxima in one or more of the feature values. The approach can handle 
more general transformations than the approach by Schmid and Mohr, at the 
cost of a more timeconsuming algorithm. 

3 Point Pair Algorithm 

The algorithm proposed here also uses affine invariants, but eliminates the se- 
arch for regions over which to compute invariant values. It introduces a higher 
computational complexity, since we look at pairs of points, but that is partly 
reduced by the fact that invariants are only computed for the image information 
on a line between two interest points, instead of on a surface patch. 

In short the algorithm consists of 7 steps. Assuming that the reference image 
has already been processed, and a new query image is to be matched to the 
reference image, these are the main steps of the algorithm: 
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1. Extract interest points from the query image using the Harris 
corner detector. Form pairs of interest points. 

2. For each pair of interest points, extract the image intensity profile 
for the line going between the two points. 

3. For each intensity profile, compute scale invariant fourier coeffi- 
cients to form feature vectors. 

4. For each feature vector in the query image, find all the feature 
vectors in the reference image which are similar. In a voting table, 
place one vote for the match of the two startpoints of the lines, 
and one vote for the match of the endpoints. 

5. Extract maxima from the voting table. This gives a set of candi- 
date point correspondences. 

6. Reject outliers using view consistency constraints. 

7. Using RANSAC, eliminate more outliers and find the epipolar 
geometry. 



We believe that the major contribution here is the use of features computed 
from intensity profiles (see Fig. and the use of view consistency constraints 
to reject outliers. 

The important parts of the steps mentioned above will now be described and 
discussed in greater detail. 



3.1 Selecting Point Pairs 

The Harris corner detector has been widely used for extracting points of interest. 
It has been shown to outperform 0 many other interest point detectors. Also, 
its common use makes it easier to compare our results to those of other authors. 

When forming pairs of interest points, it’s important not to choose all possi- 
ble pairs, since the computation time is linear in the number of pairs. It seems 
reasonable to assume that the greater the distance between two interest points, 
the less likely it is for the line between them to be on a planar surface. There- 
fore, we set a threshold on the maximal allowed distance between two points. 
Interest points shouldn’t be too close either, since the information content of 
the line between the points might be too small for a reliable computation of the 
features. Setting these thresholds is not trivial, we simply used what seemed like 
reasonable values. We worked on 768x576 images and set the thresholds to 20 
and 200 pixels respectively. 



3.2 Feature Extraction 

For each extracted intensity profile, a number of fourier coefficients are compu- 
ted. Specifically, six coefficients are used: the dot product of the profile with the 
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Image 1 Image 2 




Intensity profile from image 1 Intensity profile from image 2 





Fig. 1. The same intensity profile extracted from two different images. Under certain 
circumstances, the profiles are related by a simple scale change. 



first three sine and cosine basis vectors. To make the features invariant to scale 
changes, they are normalized by dividing by the profile length. 

Formally, given an image intensity profile p{i) of length N, the following sine 
coefficients for m G {1 2 3} are computed: 

N-l 

/m” = (27rmi/fV) (1) 

i=0 

The case for the cosine function is equivalent. These features are invariant 
to affine transformations of the planar surface on which the line lies (see Fig. 01 
Intensity normalization of each profile prior to computing the features provides 
invariance to affine illumination changes, i.e. offset and scaling of light. 

One could of course argue that the best features to use are those obtained 
through the K-L transform, but, as is well known P33, the fourier coefficients 
can sometimes offer a close approximation to the K-L coefficients. In our case, 
we did not see any consistent differences in the end results, i.e. in the accuracy 
of the estimated geometry or the number of correct matches, when we compared 
the K-L coefficients to the fourier features. Since the fourier features give a much 
faster algorithm (fixed basis vectors, profiles needn’t be rescaled), we decided to 
use them. 





340 360 380 400 420 440 460 400 450 500 550 600 




Fig. 2. Fourier coefficients are computed from intensity profiles between interest 
points. Here, three sine basis vectors are shown (amplitude adjusted to fit well in 
the image). The normalized feature values for the left image prohle are (0.56, -0.26, 
1.52) and for the right (0.57, -0.12, 1.47) 



The features we use possess an interesting property which somewhat relaxes 
the constraint that the transformation between the images must be affine. Since 
we are only looking at the image information on a line, it suffices if the transfor- 
mation is locally approximated by a 1-d affine transformation in the direction of 
the line. In many images of planar surfaces, there are severe projective distorti- 
ons, but sometimes predominantly in one direction and less in another. Since all 
possible pairs of points are formed, the algorithm can make use of the profiles 
where the local 1-d transformation is nearly affine. A similar argument applies 
to images of curved objects, for instance a bottle or a can. On a cylinder shaped 
object, there is one direction in which the surface is planar, which might allow 
a robust computation of the features, unless there is severe projective distortion 
in that direction. 
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3.3 Finding Candidate Matches by Voting 

Once the feature vectors have been computed for each profile, a voting approach 
is used to find candidate matches between interest points. All the feature vectors 
from the reference image are stored in a Kd-tree. Given a feature vector from 
the query image, the Kd-tree can be searched for all the feature vectors in the 
reference image having all their features within a specified interval. For every 
pair of interval matching feature vectors, we compute the Mahalanobis distance 
between them. If the distance is below a threshold, we cast a vote for the matches 
of both the respective start- and endpoints of the two profiles. The idea behind 
the voting process is that if two interest points are the images of the same 
physical point, then they will have more matching intensity profiles in common 
than two random points (see Fig. 0. 





400 450 500 550 600 



Fig. 3. Two matching interest points will receive many votes because they have several 
matching intensity prohles in common. 



There are several points to consider in order to accurately determine when 
two feature vectors are close enough. Ideally, the features should remain con- 
stant over affine transformations, but due to for instance illumination changes, 
camera movement, sampling and error in interest point location, they don’t. 
Unfortunately, it is extremely difficult to predict the variation of the features, 
as their values strongly depend on the degree of viewpoint change, the infor- 
mation content in the profile etc. It seems impossible to estimate parameters 
such as variances and covariances for each feature vector individually and point- 
less to estimate common parameters for all feature vectors, since the length and 
information content vary greatly among profiles. 

Apart from the variation of the features computed from an individual profile, 
one has to consider the distribution of the features of all the profiles in one image. 
If we form the covariance matrix of all the features in one image, we effectively 
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get an estimate of the dynamic range of each feature. It is that range which 
can be used for discriminating feature vectors. However, this matrix does not 
tell us anything about the distribution of single feature vectors over illumination 
changes, camera transformation etc. 

It seems reasonable though, that the magnitude of the spread of a feature 
for a single profile is related to the spread of the feature over an entire image. 
Therefore, we will assume that the variances of the features for individual profiles 
are proportional to the diagonal elements of the covariance matrix computed for 
all the feature vectors in the image. The smaller the constant of proportionality, 
the easier it is to distinguish between feature vectors. The only way we see to 
estimate the value of this constant is to try a number of different values on 
different images and see what works and what doesn’t work. A general way to 
set the constant is to choose it as large as possible, i.e. using a big window when 
matching feature vectors, without sacrificing the quality of the results. The off- 
diagonal elements in the covariance matrix will simply be put equal to zero, 
since they are usually at least an order of magnitude smaller than the diagonal 
elements (due to the similarity of the fourier and K-L coefficients). 

The downside of using a big window is that some feature vectors give many 
matches. However, assuming a normal distribution of the feature vectors in one 
image, the ones around the mean will have many matches and are essentially 
useless for the voting process. Therefore, if a feature vector in the query image 
has more than a fixed number of matches in the reference image, it is simply 
discarded in our algorithm. Using a normal distribution assumption, it is possible 
to predict and reject the feature vectors that will generate too many matches 
before searching in the Kd-tree, thus saving computation time. 

3.4 Locating Candidate Matches in the Voting Table 

When the voting process is finished, we use a simple method to locate candidate 
matches in the voting table. For every row, the maximum value is found. If that 
value is also a maximum in the corresponding column, the two interest points 
with the given row and column indices are declared to be a candidate match. 

If any corner was matched more than once in the resulting list of matches, 
we keep the match with the highest number of votes and throw away the rest. 

3.5 Outlier Rejection with View Consistency Constraints 

In the list of candidate matches, there will generally be a number of mismatches, 
due to accidental correspondence of feature vectors. To eliminate these, one 
could compute the epipolar geometry with a robust estimation method such as 
RANSAC. This is computationally expensive though, since it involves repeatedly 
selecting sets of 7 matches, computing the SVD to find the fundamental matrix 
(1 or 3 solutions) and counting the number of consistent matches by checking the 
distance of points to their epipolar lines. If there are many outliers, this will take 
a long time, since a lot of the 7-point samples will contain outliers. We propose 
an intermediate outlier rejection stage to eliminate as many outliers as possible 
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before using RANSAC. The approach is based on view consistency constraints 
im. Given a camera model and a sufficient number of points, one can eliminate 
all camera parameters from the epipolar constraint and form a single equation 
constraining the image coordinates of a set of points. 

In our case, a scaled ortographic camera model has shown to be sufficient 
to reject outliers. For this camera model, Carlsson El derives an equation con- 
straining the image coordinates of five points across two views A and B. If we 
introduce the following expressions involving image coordinates (a;^, y^) of points 
k G 1 ... 5 in view A (and equivalently for view B), 

aij = - xf) + {yf - yf){y^ - yf) 

(2) 

= («i2) Oi3) 

and use the notation [..] for the determinant, the following constraint equation 
must be satisfied if the points are actually the images of the same five points in 
space (see Fig. 0: 



[B2 A3 A4] -|- [A2R3 A4] -|- [A2 A3R4] [B2B2A4] -|- [i?2 A3R4] -|- [A2B^B4] 

[R2 A3 A5] -|- [A2B3 A5] -|- [A2 A3R5] [B2.B3 A5] -|- [i?2 A3R5] -|- [A2R3i?5] 

To find outliers, we repeatedly select random sets of five candidate matches 
and compute the constraint Q. If it is satisfied, the five matches constitute a 
view consistent group of points, and we increment a counter for each match in the 
group by one. The process terminates when the average number of increments 
has reach a certain level. This is to make sure that the amount of data gathered 
on average for each match is the same, regardless of the number of input matches 
or the fraction of outliers. To classify the matches as outliers or inliers, a k-means 
clustering on the number of times a match was in a view consistent group is used. 
The midpoint between the cluster centers is then used as a threshold to reject 
outliers. With this method, we were consistently able to reject over 50% of the 
outliers. 

3.6 Computing the Epipolar Geometry 

To further eliminate outliers while simultaneously computing the epipolar geo- 
metry, RANSAC is used. We randomly select sets of 7 matches, compute the 
fundamental matrix and count the number of consistent matches. The process 
is repeated many times in order to find a good solution. 

As a final step, the epipolar geometry is refined by a nonlinear minimization 
using all the consistent matches. We simply used Zhengyou Zhangs’ FMatrix 
software P2I m to do this. 

4 Results 

The algorithm was implemented in C-| — h and run on a Sun Ultrasparc 1, 167 
MHz. We extracted about 400 interest points from each image. The total pro- 
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Fig. 4. View constraints — single equations obtained by eliminating all camera para- 
meters from the epipolar constraint — can be used to find outliers. If the selected point 
set is view consistent, the value of the view constraint should be close to zero. Here, 
the value of the view constraint evaluated for the points marked with circles is 0.25 
and for the points marked with crosses 5.42. 



cessing time for each image pair was about one minute. The major bottleneck is 
the Kd-tree, 50% of the time was often spent searching it. 

In general, the search time for a window query in a Kd-tree is + 

where m is the number of vectors stored, d is the dimensionality of the data (in 
our case 6) and k is the number of matches found [1 . The time complexity of 
our algorithm is, given n interest points, for the feature extraction part 0{n^) 
and for the voting part -I- k)) = + k)), where k is 

the maximum number of allowed matches for each search. However, since we 
don’t select all pairs of points, the complexity is in our implementation limited 
to + nrk), where r (r < n) is a chosen number, independent of the 

number of interest points. 

Matching results are difficult to demonstrate in printed form, some authors 
show the images with the matches plotted and numbered in the images, while 
some authors choose to draw epipolar lines. We have tried to use a somewhat 
more illustrative method. Prior to taking the target image, we positioned the tip 
of a pen at the point in space where the focal point of the reference camera was. 
The epipolar point in the target image should then coincide with the image of 
the pen tip. Of course, this method can only be used if the epipolar point is in 
the image. The same method was used in another example, but instead of using 
a pen, two cameras were used, so that the reference camera is actually seen in 
the target image. 

As the results (Figs. EHE|) show, the epipolar geometry is not very accu- 
rate for some of the examples, due to the fact that the object in question has 
small depth variations, and occupies a small part of the image. Nevertheless, 
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the epipolar constraint can still be used to reject outliers among the matches. 
The examples show that the algorithm can tolerate quite a large change of scale 
between images. 

Looking closely, one can see that one or two outliers might still be present 
in the examples, especially the top example in fig. El where the epipolar lines 
are almost horizontal. The latter, combined with the repetetive structure, cau- 
ses the mismatches. This example also reveals a situation which the algorithm 
can’t handle — refiective surfaces. Due to reflection in the windows, the intensity 
profiles differs strongly between images, hence points on the upper right part of 
the building can’t be matched. 

The results for all the examples are summarized in tabled By comparing the 
last two columns of the table, it is clear that the view constraints are efficient 
in the sense that they don’t remove inkers to any larger extent. 

In the very last example (Fig. |S|), we fitted a homography with RANSAC, 
instead of finding the fundamental matrix. From this example it is clear that the 
algorithm is capable of handling projective distortions. 



Table 1. Number of matches after different steps of the algorithm. The examples are 
in the same order as in the figures. 



^ matches after each step (VC=view constraint) 




Voting 


VC 


VC and 
RANSAC 


no VC, RANSAC 


Example 1 


87 


55 


37 


39 


Example 2 


92 


50 


37 


36 


Example 3 


95 


35 


28 


27 


Example 4 


140 


37 


31 


34 


Example 5 


134 


81 


74 


78 


Example 6 


115 


100 


78 


76 



5 Conclusions and Future Work 

We have presented a wide baseline matching algorithm which is based on a new 
way of extracting features. The fact that we randomly select pairs of interest 
points and compute features for the line between the points gives a fast algorithm 
which is hopefully more tolerant to projective distortions than an algorithm 
using regions, due to the fact that the algorithm can make use of lines going in 
a direction with less projective distortion. 

Furthermore, the fact that we look at image information between points 
which are quite far apart provides more robustness than using only local in- 
formation. This is of course also a limitation of the algorithm: it needs planar 
structures. 
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Outliers will always be present in the matches based on image information 
alone. Since it is computationally expensive to reject outliers by using RANSAC 
and computing the epipolar geometry, we introduced an intermediate outlier re- 
jection stage based on view consistency constraints. This allowed us to eliminate 
more than 50% of the outliers. 

The fact that intensity profiles must lie on a line means that this algorithm is 
in its present form unsuitable to images of curved objects. On the other hand, for 
some curved objects, the whole concept of wide baseline matching is inapplicable, 
since the same part of the objects’ surface can’t be seen from viewpoints which 
differ too much. 

As for future work and improving the method presented here, there are se- 
veral points to consider. First of all, the problem of how to measure the distance 
between feature vectors requires more thought, our solution is quite ad-hoc, but 
we didn’t see any other way to do it. 

Another area which deserves more attention is the fact that the algorithm 
requires a reasonable amount of information to be present on the surface of an 
object. On surfaces with little information (i.e. constant brightness regions), the 
matches are often ambigous. It would therefore be nice to detect and handle that 
case in some other fashion. 

The time complexity of the algorithm should be improved by using a faster 
data structure. The Kd-tree used in the current implementation is a bottle-neck. 
Since the dimensionality of the feature vectors is quite low, it might be possible 
to use indexing instead. 

Finally, the number of matches found by the algorithm is generally quite low 
for wide baselines. Searching for more matches along epipolar lines or combining 
this method with Pritchetts’ surface following approach 0 would potentially 
give more matches. 
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Fig. 5. Results for matching image pairs. Number of matches: (after voting/view 
constraints/RANSAC) top: (87/55/37) middle: (92/50/37) bottom: (95/35/28). The 
tip of the pen or the camera lens serve as ground truth for the position of the epipolar 
point. The three large circles in the left images indicate the points that generated the 
epipolar lines in the right images. 








Fig. 6. Results for matching image pairs. Number of matches: (after voting/view 
constraints/RANSAC) top: (140/37/31) middle: (134/81/74) bottom: (115/100/78). 
The three large circles in the left images indicate the points that generated the epipolar 
lines in the right images. In the bottom example a homography was estimated using 
RANSAC. The four corner points of the box in the lower left image was warped using 
the estimated homography, resulting in the quadrangle shown in the lower right image. 
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Abstract. In order to build a statistical model of appearance we require a set of 
images, each with a consistent set of landmarks. We address the problem of auto- 
matically placing a set of landmarks to dehne the correspondences across an image 
set. We can estimate correspondences between any pair of images by locating sali- 
ent points on one and finding their corresponding position in the second. However, 
we wish to determine a globally consistent set of correspondences across all the 
images. We present an iterative scheme in which these pair-wise correspondences 
are used to determine a global correspondence across the entire set. We show re- 
sults on several training sets, and demonstrate that an Appearance Model trained 
on the correspondences can be of higher quality than one built from hand marked 
images. 



Keywords: Image features, Statistical models of appearance, correspondence 



1 Introduction 

Statistical models of shape and appearance have proved powerful tools for interpreting 
images, particularly when combined with algorithms to match the models to new images 
rapidly | |.SlS)| | . In order to construct such models we require sets of labeled training images. 
The labels consist of landmark points defining the correspondences between similar 
structures in each image across the set. 

The most time consuming and scientifically unsatisfactory part of building the models 
is the labeling of the training images. Manually placing hundreds of points on every image 
is both tedious and error prone. To reduce the burden, semi-automatic systems have been 
developed. In these a model is built from the current set of examples (possibly with extra 
artificial modes included in the early stages) and used to search the next image. The user 
can edit the result where necessary, then add the example to the training set. Though 
this can considerably reduce the time and effort required, labeling large sets of images 
is still difficult. 

We present a method which, given a bounding box of the object in each training 
image, automatically returns a set of correspondences across the entire training set. The 
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found correspondences can then be used to build appearance models of the object. The 
approach is to first find correspondences between pairs of images, then to use an iterative 
scheme to estimate correspondences across the whole training set. 

We find a set of salient points in each image, li, then find the best corresponding 
points in each of the other images, Ij . The set of points in li and correspondences in Ij 
define a mapping from li — ^ Ij , (x) (we use a Thin plate spline to obtain a continuous 

mapping). However, because we may use a different set of points to find the mapping, 
Ij — >■ li, Tjiix), there is no guarantee that Tij{x) = Tji{x)~^. In general, for three 
images Tij{x).Tjk{x) ^ T,u{,x). 

We seek to derive a set of new transformations between images, Gij, which are 
globally consistent, ie Gij = G~^ and Gij.Gjk = Gik- 

In practice we represent the transformations using the nodes of a grid. Placing this 
grid in every image allows us to map from any image to another and back, hence, defining 
a globally consistent transform. The only errors being caused by interpolation, or if the 
grid ’folds’. We wish the new transformations, Gij, to be as close as possible to those 
derived from the correspondences Ty . We present an iterative scheme in which the 
correspondences are used to drive the grid points towards a global solution. 

In the following we briefly review how to construct appearance models. We describe 
how salient features can be used to define robust correspondences between image pairs. 
We explain how the correspondences together with Thin-plate splines can provide a con- 
tinuous but globally inconsistent transformation between image pairs. We then describe 
the iterative scheme which creates a globally constant transform from the pair-wise 
transforms. Finally we show some examples of automatically built appearance models 
and compare them with appearance models built from manually labeled training data. 



2 Background 

2.1 Globally Consistent Transforms 

Below we describe an algorithm, which is effectively an automatic method of multiple 
image registration. A comprehensive review of work in this field is given by Viergever 
te al m. Here we give a brief review of more recent and relevant work. 

Many authors have attempted automatic or semi automatic landmarking methods in 
2D. Often authors assume that various contours have already been segmented from the 
training set an. This is in itself a time consuming problem. 

Frey et al present Transformed Component Analysis which can learn linear 
transformations such as translations and scale. Their method does not learn the non-linear 
transformations necessary to provide a true dense correspondence between images. 

Walker et al m attempted to automatically train Appearance Models on an image 
sequences by tracking salient features. Although models were built successfully, often 
the tracking broke down. 

Lester et al fTTl describe a non-linear registration algorithm allowing different types 
of viscous fluid flow model. It only allows mapping from one image to another, and 
requires a fairly complex prior description of how different parts of the image are allowed 
to move. 
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Christensen [Oil demonstrates pair-wise image registration using symmetric transfor- 
mations which are guaranteed to be consistent (ie the mapping from image A to image B 
is the inverse of that from B to A). He uses a fourier basis for the allowed deformations, 
and shows how a consistent mean image can be built by repeatedly mapping different 
images to the current mean. Our method differs in that we match each image with all 
others, not just to the current mean. 

Collins et al m register two images by building a non-linear deformation field, 
recursively matching local spherical neighborhoods. Deformations are only constrained 
to be locally smooth, imposed by explicit smoothing of the displacement field. 

The closest work to ours are the ‘Multidimensional Morphable Models’ of Jones 
and Poggio m- These are linear models of appearance which can be matched to a new 
image using a stochastic optimisation method. The model is built from a set of training 
images using a boot-strapping algorithm. The current model is matched to a new image, 
optical flow algorithms are used to refine the fit. This gives the correspondences on the 
new image, allowing it to be added to the model. 

Most of the above methods concentrate on matching image pairs, our technique 
matches sets of images simultaneously. 



2.2 Appearance Models 

An appearance model can represent both the shape and texture variability seen in a 
training set. The training set consists of labeled images, where key landmark points are 
marked on each example object. 

Given such a set we can generate a statistical model of shape variation by applying 
Principal Component Analysis (PCA) to the set of vectors describing the shapes in the 
training set. The labeled points, x, on a single object describe the shape of that object. 
Any example can then be approximated using: 

x = x-|-Psbs (1) 

where x is the mean shape vector, Ps is a set of orthogonal modes of shape variation 
and bg is a vector of shape parameters. 

To build a statistical model of the gray-level appearance we warp each example image 
so that its control points match the mean shape (using a triangulation algorithm). Figure 
n]shows three examples of labeled faces. We then sample the intensity information from 
the shape-normalised image over the region covered by the mean shape. To minimise 
the effect of global lighting variation, we normalise the resulting samples. 

By applying PCA to the normalised data we obtain a linear model: 

g = g -t- Pgbg (2) 

where g is the mean normalised grey-level vector, Pg is a set of orthogonal modes 
of intensity variation and bg is a set of grey-level parameters. 

The shape and texture are correlated. By further analysis [5| we can derive a joint 
model of shape and texture: 
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Fig. 1. Examples of faces labeled with consistent landmarks 



X = X + Qj-c (3) 

y = y + QyC (4) 

where Q^; and Qy represent modes of shape and texture variation, and c is a vector 
of appearance model parameters. 

An example image can be synthesized for a given c by generating the shape-free 
grey-level image from the vector g and warping it using the control points described by 

X. 

3 Automatically Building Appearance Models 

To build an Appearance Model it is necessary to calculate a dense correspondence bet- 
ween all examples in the training set. This is typically achieved by placing consistent 
landmarks on all training examples and using a triangulation algorithm to approximate 
the dense correspondence. Below we describe a method of hnding a consistent set of 
landmarks automatically. 

In the following sections we describe how we hnd correspondences between pairs of 
images, and how thin plate splines can then be used to define a continuous transformation, 
Tij , between image pairs. We will then explain how an iterative scheme can be employed 
to calculate a new transformation which is globally consistent across the entire training 
set. This globally consistent transform provides the required dense correspondence. 

3.1 Locating Correspondences between Image Pairs 

The aim is locate a set of points in each image, 7^, and find the best corresponding points 
in all other training images, Ij. Correspondences can be located by selecting features in 
image li and locating them in image Ij . Walker et al O have shown that the probability 
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of calculating correct correspondences can be increased by selecting salient features in 
image, U. 

For every pixel within the object boundary in we construct several feature vectors, 
each of which describes a feature centered on the pixel at a particular scale. In the 
following the feature vectors used were the first and second order normalised Gaussian 
partial derivatives, giving a hve dimensional feature vector for each scale considered. 
Higher orders could be used depending on the computational power available. 

The full set of vectors describing all facial features (one per image pixel), forms a 
multi-variate distribution in a. feature space. Walker et al define salient features to be the 
ones which lie in low density areas of this space, ie the ones which have a low probability 
of being misclassihed with any other feature in the object. 

The result of this analysis is a set of salient features per training image. Note that the 
salient features in one training image are likely to be different from the salient features in 
another training image. Let Sip be the spatial position of the pth salient feature selected 
from image f. Let be the covariance matrix calculated from all feature vectors 
extracted from training image f. 

FigureElshows examples of the salient features extracted from one training example 
at different scales. 




Fig. 2. Examples of the positions of fine scale (a), medium scale (b) and course scale (c) salient 
features. 



In order to locate a correspondence between pairs of images we need to locate the 
best match for each salient feature from image f in image Ij. In order to simplify 
this problem we make the following assumptions about the object we are attempting to 
model: 

- The objects features will not move more than a given number, r, pixels between 
training examples. 

- The scale and orientation of the object and its features will not change significantly. 
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These assumptions help constrain the search and reduce processing time. Only 7rr^ 
candidate pixels a particular training image need to be considered in order to locate the 
best match for one particular salient feature. A candidate pixel has an associated feature 
vector, V. The similarity between the pth salient feature from image li, and a candidate 
vector, V, is given by: 



<5(vip,v) = (v - Vip)^Cj ^(v-Vip) (5) 

where is the feature vector describing the pth salient feature in image li. 
FigureOJillustrates how a match for a salient feature from training image li is located 
in a second training image Ij . 




Fig. 3. Calculating a correspondence between image U, (a), and Ij, (c). (b) is the similarity image 
obtained whilst searching for the salient feature from image U in image Ij . 



By calculating the similarity, 5, for all candidates in image Ij , we can form a similarity 
image as shown in figure Otb). We locate the best match by locating the lowest trough 
in the similarity image. Let rriijp be the spatial position in image Ij of the best match 
to the pth salient from image li. Let dijp be the similarity value, 5, of the match rriijp. 
Note that dijp is also linearly related to the log probability of the match rriijp. 

3.2 Calculating the Spatial Errors of the Matches 

For each salient feature, Sip, in image li we have located its position, rriijp, in image Ij 
and also a measure of probability, dijp, of the match being correct. The similarity image 
as shown in figure Ofb) contains further information regarding the errors in the spatial 
position of the match. In sect! on we show how this information can be used to when 

calculating the pair-wise image transforms. 

For each correspondence we calculate a 2D covariance matrix which describes the 
spatial errors for the match. The covariance matrix is obtained by fitting a quadratic to 
the surface of the similarity image. 
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Figure 0 shows two training images. The salient features found in the left hand 
training image are shown and the positions of their matches in the second image are 
indicated on the right. The ellipses centered on each match represent the spatial errors 
of that match. Note that features which lie on edges or ridges have a small error perpen- 
dicular to the edge hut a large error parallel to the edge. This indicates that the match 
should only he used to constrain a correspondence in a direction perpendicular to the 
edge. 




Fig. 4. The spatial errors associated with salient feature matches, (a) is a training image with its 
salient features marked, (b) shows a second Uaining image with the position of the best matches 
to the salient features in (a) shown. The ellipses centered on each match in (b) represent the spatial 
errors of that match. 



3.3 Transformation between Images 

Let Xj = fij{xi\yi,yj) be a mapping from points in image li to points Xj in image 
Ij, controlled by a set of control points and yj. For instance, we can use a thin-plate 
spline with control points and yj to define this mapping. 

The correspondences and associated spatial errors, can be used to define the con- 
tinuous transformation, T^, between image li and Ij. Rohr et al Cl showed how 
anisotropic control point errors, such as those described in section^21 could be inte- 
grated into a thin-plate spline. This means that matches will only constrain the spline 
strongly where the match was accurately located. Thus 






(6) 
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where are the salient features in image i, and niy the corresponding matches in 
image j. Figure^ shows how the transformation can be used to map a grid in image 
li onto image Ij . 




Fig. 5. Transformation Tij has been applied to the grid in the left hand image in order to locate its 
position in the right hand image. The control points (salient features and the matches) are mark in 
each image. 



3.4 Calculating the Global Correspondence 

So far we have created a transformation, T^, between image li and image Ij. This 
transformation can be computed for all Image pairs. However, there is no guarantee 
that Tij = since different correspondence pairs may be used. In general, for three 
images Tij.Tjk ^ Tik- We seek to derive a set of new transformations between images, 
Gij which are globally consistent, ie Gij (x) = Gji{x)~^ andGij{x).Gjk{x) = Gik{x). 

We represent Gtj using the nodes of a grid. Placing this grid on every image allows 
us to map from any image to another and back. 

In order to calculate a globally consistent correspondence across the entire training 
set we employ an iterative scheme which uses to rehne Gij by adjusting the position 
of the grid nodes. 

The scheme is hrst initialised with a approximation to the hnal global correspon- 
dence. A rectangular grid is placed over each training image. The grid has the same 
connectivity in all Images but is scaled to the approximate size of the target in each 
image, is the position of the grid nodes in image li. We call this the initialisation 
grid. Figure |3a) shows an example of this approximate correspondence for a small 
training set. Note that if we choose to use as landmarks at this stage, the resulting ap- 
pearance model would be equivalent to an eigen model HI, since no shape deformation 
is included at first. 
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Fig. 6 . Example of how the grid deforms to represent the underlying shape, (a) shows how the 
grid is initialise, (b) after one iteration and (c) is after convergence. 



At each stage we have the grid points Xi an each image li. One iteration consists of 
updating each Xi in turn as follows. We use the pairwise transformations to project 
every Xj onto the zth image, then compute a weighted average. More explicitly, we 
update Xi using: 



rit nt 

x' = 5 ({xJ, {T,,}) = W,)-i ^ W,T,,(x,) (7) 

3 

where nt is the number of training images and is a diagonal matrix of weights, 
the element of which describes the confidence in the prediction of the position of 
the node (estimated from the confidence of the matches, djip, used to define Tji). 

After each iteration it is important to normalise the grid, x^. This is because it is pos- 
sible for the grid to move off the object slightly. In order to normalise x^ we repropagate 
the initialisation grid from the first training example, r, onto all other examples, using 
Xi as the control points for the mapping, x^ then becomes this newly propagated grid. 
Mathematically, x^ is normalised as follows: 
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Xi ^ /ii(r|xi,Xi) (8) 

where r is the initialisation grid for training example 1 . 

The process converges after a few iterations. Figure 0 shows how the x^ deforms 
after each iteration. After one iteration the grids are no longer rectangular, and therefore 
an Appearance Model hullt from the grid nodes would capture some shape variation. 
This will make the model much more compact than an equivalent eigen model. 

3.5 Multi-resolution Framework 

Section rr?l describes an iterative scheme in which an initial approximation to the true 
globally consistent transform is improved using pair-wise transformations. This ap- 
proach lends itself well to a multi-resolution framework which improves the systems 
robustness to extreme deformations in the training examples. This involves building a 
gaussian pyramid IQ for each training example (where level 0 is the original image 
and level Lmax is the coarsest level), then performing the above analysis on the course 
scale training images to get an approximate globally consistent transform. At the course 
resolution we can search more of the image efficiently to obtain the correspondences 
between image pairs, allowing for larger deformations between images. We then use 
the hner resolutions to further rehne the globally consistent transform. The following 
pseudo code further explains the multi-resolution framework. 

1 . Set Xj to the initialisation grid 

2. Set L = Lrnax 

3. While L > 0 

a) For each image i 

i. Locate salient features, s^, at level L from training image i 

ii. For each image j i 

A. Predict the approximate positions, mL = /(si|xi, Xj), of each of the 
salient features, s^, for image i in image j 

B. Search nearby predicted matches, mL, for better matches, , using 
equation|3 

b) Define the set of pair-wise transformations, {Tiy}, using the salient features, s^, 
together with their matches, m.ij , as shown by equation^ 

c) Define a new globally consistence transformation, x^, using the iterative scheme 
defined in section lT?l 

d) IfL > OthenL^ (L - 1) 

4. The final result is the globally consistent transform as described by x^ 

FigureQillustrates how the grid , x^, deforms as the resolutions are descended. 



4 Results 

We attempted to use the multi-resolution scheme to automatically landmark 4 training 
sets. Each training set had pose and expression variation, but no identity variation. We 
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used the automatically generated landmarks to build an appearance model for each of the 
training sets. In order to provide a means of contrast, we also built two other models from 
each training set. The hrst of these was a trained using hand placed landmarks and the 
second was an eigen model [Q . Figure |SJ;a) illustrates the principle modes of variation 
of the manually trained model, (b) shows the modes of the automatically trained model, 
and (c) shows the modes of the eigen model. Note that the modes of the manually trained 
and automatically trained model are of a similar quality when compared with those of 
the eigen model. 

It is difficult to assess the results quantitatively without considering the application 
to which the model will be put. For effective coding (compression), we require that 
truncating the number of model modes has a minimal effect on the reconstruction error. 
We can thus compare models by how the reconstruction error varies with the number of 
modes. 

Figure 0 shows how the reconstruction texture error, (a), and the shape error, (b), 
increase as the model modes decrease for one particular training set. The graphs show that 
both the texture error and shape error are reduced by training the model automatically. 
The automatic method thus leads to a more compact model for this training set. As 
expected, the texture error of both the manually and automatically trained models is 
signihcantly better than that of the eigen model. 

Out of the 4 training sets tested all of the automatically generated models proved to 
be more compact that their equivalent manually trained models. 



5 Discussion 



We have demonstrated one possible approach to automatically training appearance mo- 
dels. The system calculates globally inconsistent transformations between all image 
pairs. These transformations are used to drive an iterative scheme which calculates a 
globally consistent set of transforms. The globally consistent set of transforms is used 
to provide the correspondence necessary to build an appearance model. The robustness 
of the scheme also benefits from a multi-resolution framework. 

This work should be considered as a general method in which to obtain globally 
consistent transforms from pair-wise transforms. The pair-wise transforms could be 
generated from other methods such as optical flow. 

Techniques that attempt to automatically correspond sets of images by starting with 
an inadequate model and then sequentially try to add unseen instances to improve the 
model maa have one serious problem. This is if at the early stages of training they are 
presented with an unseen image that is radically different to anything previously seen a 
good fit will not be found. Adding this ht to the model only serves to corrupt the model, 
increasing the chances of further failures. Moreover, the success of these techniques is 
dependent on the order in which the training images are presented to the system. The sets 
of images used in these techniques often sidestep this problem. Walker et al [16(| used 
image sequences ensuring a gradual change. The technique presented in this paper does 
not deal with the training images sequentially and therefore the success is not dependent 
on the ordering. 
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An important conclusion drawn from this work is that models created from hand 

landmarked training data are not a gold standard, models created from automatically 

labeled data can be of a higher quality due to the elimination of human error. 
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Fig. 7. Example of how the grid deforms to represent the underlying shape during multi-resolution 
analysis, (a) shows how the grid is initialised, (h) is after the coarsest resolution L = Lmax, (c) 
is after L = Lmax — 1, (d) is after convergence. 
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Fig. 8. The 3 most significant modes of variation for an manually trained (a), an eigen model (b) 
and automatically trained model (c). The modes in all cases are shown to ± 2.5 standard deviations. 
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Fig. 9. Compares the texture error (left) and shape error (right) of an automatically built model 
with that of a model trained from hand placed landmarks. 
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Abstract. A new algorithm for approximating intensity images with adaptive 
triangular meshes keeping image discontinuities and avoiding optimization is 
presented. The algorithm consists of two main stages. In the first stage, the 
original image is adaptively sampled at a set of points, taking into account 
both image discontinuities and curvatures. In the second stage, the sampled 
points are triangulated by applying a constrained 2D Delaunay algorithm. The 
obtained triangular meshes are compact representations that model the regions 
and discontinuities present in the original image with many fewer points. 

Thus, image processing operations applied upon those meshes can perform 
faster than upon the original images. As an example, four simple operations 
(translation, rotation, scaling and deformation) have been implemented in the 
3D geometric domain and compared to their image domain counterparts.* 

1 Introduction 

The standard formats commonly used for image compression, such as GIF and 
JPEG, were not originally devised for applying further processing. Thus, images 
codified in those formats must be uncompressed prior to being able to apply image 
processing operations upon them, no matter how big and redundant the images are. 
Nonetheless, some researchers have managed to apply various basic operations upon 
compressed representations. For example, [1] presents a technique for applying 
arithmetic operations directly to JPEG images. Several techniques for image manip- 
ulation and feature extraction in the DCT domain are also presented in [2]. 

An alternative to the problem of compactly representing images consists of the 
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utilization of geometric representations, such as triangular meshes. Those meshes 
allow the modeling of large areas of pixels with basic geometric primitives. For 
example, a large white region can be represented by a few triangles instead of by 
hundreds of pixels. Geometric representations are applicable since the pixels of an 
image can be considered to be 3D points in a space in which coordinates x and y are 
functions of the rows and columns of the image, and coordinate z is a function of the 
gray level. An additional advantage of using geometric representations is that they 
allow the application of techniques that have been successfully utilized in other 
fields, such as computer graphics or computer vision (e.g., [3][6][7][15]). 

Some algorithms have been proposed for generating geometric approximations 
from images (e.g., [8][9]). These algorithms generate an initial high-resolution mesh 
by mapping each pixel of the original image to a point in a 3D space and by linking 
those points according to some topological criteria. Then, an iterative, optimization- 
based algorithm decimates the obtained mesh until either a certain maximum error 
between the current mesh and the original image is reached or a certain number of 
points is attained. A recent algorithm [5] decimates an initial mesh based on a non- 
optimization-based iterative algorithm that ensures a maximum approximation error. 
The inconvenience of that method is that, since image discontinuities are not explic- 
itly modeled, an oversampling of the given images is generated. 

Besides the use of geometric representations as tools for image modeling, little 
research has been done regarding their utilization for simplifying and accelerating 
general image processing operations. For instance, [4] presents a technique for seg- 
menting range images approximated with adaptive triangular meshes. Furthermore, 
[10] and [16] present an algorithm for segmenting intensity images. This algorithm 
is based on an initial triangulation of corners detected in the image, followed by an 
iterative, optimization-based split-and-merge technique applied to the triangles of 
the mesh. 

This paper presents an efficient algorithm for approximating intensity images 
with adaptive triangular meshes without applying iterative optimization. The 
obtained meshes preserve the shades and discontinuities present in the original 
image. Furthermore, an efficient technique for converting triangular meshes to inten- 
sity images is also described. Finally, it is shown how the previously obtained 
triangular meshes can be utilized to accelerate image processing operations through 
four basic translation, rotation, scaling and deformation operations. 

This paper is organized as follows. The proposed approximation algorithm is 
described in section 2. Section 3 presents a technique for generating intensity images 
from triangular meshes. Section 4 describes the implementation of four simple 
image processing operations applied upon the previously obtained triangular 
meshes. Experimental results are shown in section 5. Conclusions and further 
improvements are presented in section 6. 

2 Approximation of Intensity Images with Adaptive Triangular 
Meshes 



This section presents a technique for approximating intensity images with disconti- 
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nuity-preserving adaptive triangular meshes. The triangular meshes are composed of 
points sampled over the original image. Each point corresponds to a certain pixel 
and is defined hy three coordinates: row, column and gray level. The proposed tech- 
nique consists of two stages. The first stage adaptively samples the given image at a 
set of pixels. First, the image edges are detected, adaptively sampled and approxi- 
mated by polylines. Then, the internal regions comprised between image edges are 
adaptively sampled. The second stage triangulates the points sampled over internal 
regions, using the previously obtained polylines as constraints for the triangulation. 
Both stages are described below. 

2.1 Image Adaptive Sampling 

The aim of this stage is to sample the given image, obtaining a set of points that are 
distributed according to the shades present in the image. The triangulation of those 
points will be used as a higher abstraction-level representation of the original image. 
Edges (contours) and internal regions in the image are approximated separately. 

2.1.1 Edge Adaptive Sampling 

Intensity images usually contain sudden changes in gray level due to region bound- 
aries. The edge sampling stage approximates those boundaries with adaptive 
polylines. First, the edges present in the original image are found by applying 
Canny’s edge detector [11] and then by thresholding the result so that all pixels with 
a value above zero are set to gray level 0 (black) while the other pixels are set to gray 
level 255 (white). Thus, an edge image is generated, such as it is shown in Fig. 1 
(right). 




Fig. 1. (left) Original image of 512x512 pixels, (right) Edge image generated from the previous 
image. 



Then, each edge in the edge image is adaptively approximated by a collection of 
segments that constitute a polyline. The points that define the segments of a polyline 
are obtained through the following iterative procedure. 
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Fig. 2. Edge sampling process. (fe/f) Edge extraction, (right) Edge unfolding. 



First, the edge image is scanned from left to right and top to bottom until a pixel 
contained in an edge is found. This pixel is chosen as the starting point, A. The cho- 
sen edge is traversed from the starting point, and a second pixel contained in the 
same edge and placed at a user defined number of pixels away from the starting 
point is selected. This second pixel is the reference point, C. Both, the starting point 
and the reference point generate an approximating segment, AC. The pixels that 
belong to the chosen edge comprised between the starting point and the reference 
point constitute the approximated points. The distance in image coordinates between 
each approximated point and the approximating segment is the approximation error, 
. If all the current approximation errors are below a given threshold d, a new ref- 
erence point is selected by advancing the previous reference point C a fixed number 
of pixels along the chosen edge. Then, a new segment, joining the starting point and 
this new reference point is defined and the previous procedure is iterated. 

When an approximated point is found to have an error above d, that point is cho- 
sen as the new starting point, B. The edge is traversed in this way until either one of 
its extremes is reached or a bifurcation is found. The polyline that approximates the 
previous edge with an error bounded by d is the set of segments that join all the start- 
ing points found during the exploration of the edge, plus the final point in the edge 
(extreme or bifurcation). The points that define the polyline constitute the control 
points. Fig. 2 (left) shows an example of the previous procedure. 

When an edge has been successfully approximated by a polyline, all the points 
traversed during the process are removed from the edge image so that they do not 
intervene in further edge approximations. This polyline extraction procedure is 
applied until all edges have been approximated and, therefore, the edge image is 
white. Since the starting points of the polylines are found by applying the aforemen- 
tioned scan-line algorithm, different executions of the edge sampling process upon a 
same image will produce the same polylines. 

Each polyline obtained above delimits a boundary between two neighboring 
regions, indicating a discontinuity in the gray level values. The points that form the 
polyline (control points) correspond to pixels that can be located at both sides of the 
discontinuity. However, the final 3D triangular mesh requires that these discontinui- 
ties be modeled as vertical “walls”. These walls can only be produced by unfolding 
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each obtained polyline into two parallel polylines, each at a different gray-level 
(height). This process is done by computing an opposite point for each polyline’s 
control point. Each opposite point will be located at the other side of the discontinu- 
ity in which its corresponding control point lies. 

Given a control point P, its corresponding opposite point is obtained as follows. 
First, an exploration line that bisects the polyline’s segments that meet at P and that 
passes through P is computed. Fig. 2 {right). The pixels traversed by this line are 
explored in both directions starting with P. The first pixel along that line where a sig- 
nificative change of gray level occurs is chosen as P’s opposite point. The opposite 
points corresponding to the extremes of the polyline are determined by considering 
as exploration lines the lines perpendicular to the segments that abut at those 
extremes, and then by applying the previous criterion. Fig. 2 (right) shows an exam- 
ple of this procedure. 

A new polyline is obtained for each original polyline by linking its corresponding 
opposite points. Since the control points that define the original polyline may not be 
located at the same side of the discontinuity, the new polyline and the original one 
may not be parallel and, therefore, may have some segments that self-intersect. To 
avoid this problem, the two polylines are traversed exchanging corresponding pairs 
of control and opposite points, such that each polyline only contains the points that 
are located at the same side of the discontinuity (all the points of a polyline must 
have a similar gray level). Thus, two parallel polylines are finally generated from 
each original polyline, one completely lying on one side of the discontinuity (at the 
region with the highest gray level) and the other completely lying on the other side 
(at the region with the lowest gray level), Fig. 2 (right). Fig. 3 shows the set of con- 
trol points and opposite points corresponding to the example utilized so far. 




Fig. 3. (left) Set of both control and opposite points obtained by the edge adaptive sampling 
process (4,900 points), (right) Parallel polylines obtained from the previous points. 



2.1.2 Region Adaptive Sampling 



This stage aims at obtaining a set of points adaptively distributed over the image. 
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Fig. 4. (left) Curvature image, {center) Adaptively sampled vertical curves (7x7 tiles), (right) 
Set of points obtained by region adaptive sampling (2,324 points). 



such that they concentrate in high-curvature regions and scatter over low variation 
regions. The sampling process must he applied hy taking into account that all the 
edges in the image have already been considered by the previous step and do not 
have to be resampled. 

This process is done by applying a non-optimization, adaptive sampling tech- 
nique previously developed [3]. That technique, which was originally proposed for 
range images, efficiently samples a predefined number of pixels over the given 
image adapting to the curvatures present in it. In order to apply that sampling tech- 
nique to intensity images, it suffices to consider each pixel as a 3D point with three 
coordinates, which correspond to the pixel’s row, column and gray level. The adap- 
tive sampling process is summarized below. A complete description can be found in 

[3]. 

First, a curvature image [3] is estimated from the original image. Fig. 4 (left). 
Since the image contours have already been approximated with parallel polylines 
(section 2.1.1), the curvatures of the pixels that belong to the contours of the original 
image, detected through Canny’s edge detector, and their adjacent neighbors are 
reset so that they do not cause further resampling. 

Both, the original and curvature images are then divided into a predefined num- 
ber of rectangular tiles. The following steps are independently applied to each tile. 
First, a predefined number of points is chosen for each row of every tile, in such a 
way that the point density is proportional to the curvatures previously computed for 
the pixels of that row. After that horizontal sampling, a set of vertical curves is 
obtained, Fig. 4 (center). 

Then, each vertical curve is adaptively sampled at a predefined number of points 
whose density is again proportional to the curvature estimated for the pixels con- 
tained in the curve. In the end, a predefined number of adaptively sampled points is 
obtained for every tile, Fig. A(right). The number of both tiles and points per tile is 
defined by the user. Many tiles lead to uniformly-sampled meshes, while a few tiles 
lead to degenerated meshes. An intermediate value must be experimentally set. 

The merging of the two previous sets of points, obtained after both edge sam- 
pling and region sampling, produces the final result of the adaptive sampling stage. 
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Fig. 5. {left) Final set of points obtained after the adaptive sampling process (7,224 points). 
(center) Adaptive triangular mesh generated from the previous points, (right) Approximating 
image obtained from the previous triangular mesh through z-buffering in 0.31 sec. 



Fig. 5 (left) shows the set of sampled points that approximate the given original 
image. 

2.2 Triangular Mesh Generation 

The final aim of this stage is the generation of an adaptive triangular mesh from the 
set of points obtained after the adaptive sampling stage. This triangular mesh is an 
approximation of the given intensity image obtained by means of a constrained 2D 
Delaunay triangulation algorithm [12]. The triangulation is applied to the x and y 
coordinates of the sampled points. The constraints of the triangulation are the paral- 
lel polylines that approximate the detected contours of the image. These polylines 
can be either open or closed. In this way, it is guaranteed that the discontinuities of 
the gray level values are preserved as edges in the generated triangular mesh. Fig. 5 
(center) shows the final adaptive triangular mesh obtained for the current example. 
The z coordinates of the vertices of that mesh correspond to the gray levels associ- 
ated with them in the original image. 

3 Generation of Intensity Images from Triangular Meshes 

Any compression or coding algorithm requires a corresponding decompression or 
decoding counterpart that allows the recovery of data in the original format. Like- 
wise, a tool for generating intensity images from adaptive triangular meshes is 
necessary. 

Triangular meshes are utilized to represent intensity images by assuming that the 
first two dimensions of the points in the mesh correspond to row and column image 
coordinates, and the third dimension to a gray level. An intensity image can be gen- 
erated from a triangular mesh by considering that each triangle of the mesh 
represents a plane that contains a set of points (pixels). The z coordinates of the 
points inside that plane represent gray level values in the resultant image. Flence, the 
image generation process can be based on computing the bounding box of every tri- 
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angle T of the mesh, then finding the pixels of this box which are contained within T, 
and finally obtaining the z coordinates of these points by using the plane equation 
corresponding to T. 

However, by taking advantage of the 3D nature of these triangular meshes, the 
computational cost of the previous algorithm can be reduced almost by half by 
applying a well-known algorithm, z-bujfering, which is extensively utilized in com- 
puter graphics. Thus, the image generation stage has been finally implemented as 
follows. First, the triangular mesh is visualized in a window with the same size as 
the desired image through functions of the standard 3D OpenGL library. Next, the z- 
buffer obtained after this visualization is read with another OpenGL function 
iglReadPixels). Finally, the intensity image is obtained by linearly mapping the val- 
ues of the z-buffer to gray levels in the range [0-255]. 

Since the implementations of the OpenGL library take advantage of hardware 
acceleration in most current computers (including PCs), the whole process turns out 
to be very fast. For example, the approximating image (with 262,144 pixels) corre- 
sponding to the example used so far was obtained in 0.31 seconds on a SGI Indigo II, 
Fig. 5 (right). Other examples are shown in Fig. 6 and Fig. 7. 

The approximation error (RMS) of the image shown in Fig. 5 (right) is 12.1. This 
RMS error can be bounded to a desired tolerance by applying a previously devel- 
oped algorithm [5], which approximates intensity images through bounded error 
triangular meshes. 

4 Geometric Processing of Triangular Meshes 

The triangular meshes obtained above are representations of intensity images at a 
higher level of abstraction. This allows the application of many image processing 
operations more efficiently than if they were applied upon the individual pixels of 
the original images. For example, translation, scaling, rotation and deformation 
operations are trivially implemented by applying affine transformations to the 3D 
coordinates of the points that constitute the meshes (see Fig. 6). Since those adaptive 
meshes contain a fraction of the original amount of pixels, these operations perform 
faster than their pixel-to-pixel counterparts. The actual results corresponding to the 
examples shown in Fig. 6 are presented in the next section. 

Another operation that can benefit from the previously obtained triangular 
meshes is image segmentation. Image segmentation algorithms based on split-and- 
merge techniques must iteratively process all the pixels of the input images, group- 
ing and ungrouping them according to some uniformity criteria. Alternatively, the 
triangles of an adaptive triangular mesh already capture some of those criteria. 
Therefore, triangles can be merged instead of pixels, leading to an overall speed-up. 
In order to implement such an image segmentation algorithm in the geometric 
domain, it is possible to apply a fast technique previously developed for segmenting 
range images [4]. This approach is more efficient than the one presented in [10] [16], 
since the latter generates an adaptive triangular mesh by applying an optimization- 
based split-and-merge algorithm that considers the pixels contained in every trian- 
gle. Conversely, the proposed technique does not require any costly optimization 
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Fig. 6. Approximating images obtained by applying simple geometric operations, {top-left) 
Translation, {top-right) Rotation, {bottom-left) Scaling, {bottom-right) Ellyptical deformation. 



stages. 

These basic operations are just a few examples of how triangular meshes can 
help speed up conventional image processing. Many other computer vision and 
image processing tasks are also susceptible to be accelerated with geometric pro- 
cessing of adaptive triangular meshes. This will be the subject of further research. 

5 Experimental Results 

The proposed approximation algorithm has been tested with intensity images of dif- 
ferent size and also compared to both a uniform (non-adaptive) sampling technique 
and a mesh decimation technique based on iterative optimization [13]. A public 
implementation of the latter technique (Jade) has been utilized. 

The uniform sampling technique consists of choosing one pixel out of a pre- 
defined number of pixels along the rows and columns of the image. On the other 
hand, the optimization-based technique (Jade) starts with a high resolution triangular 
mesh containing all the pixels from the image, and decimates it until either a certain 
number of points is obtained or the approximating error is above a threshold. In 
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Fig. 7. Approximating images, {left column) Uniform (non-adaptive) sampling. RMS errors: 
14.3 {top) and 18.3 (bottom), {center column) Proposed technique. RMS errors: 9.8 (top) and 
10.6 (bottom), (right column) Optimization-based technique (Jade). RMS errors: 11.7 (top) 
and 15.3 (bottom). 



order to be able to compare these techniques with the proposed one, Jade and the 
uniform sampling process were run to produce triangular meshes with the same 
number of points as the ones obtained with the proposed technique. 

Fig. 7 shows two of the test images. The triangular meshes generated from these 
images with the proposed technique were obtained in 7.64 (Lenna) and 1.26 (House) 
seconds. All CPU times were measured on a SGI Indigo II with a I75MHz RIOOOO 
processor. These meshes contain 7,492 and 1,649 points respectively. The RMS 
errors of the approximating images are 9.8, Fig. 7 (top-center), and 10.6, Fig. 7 (bot- 
tom-center). If subsequent image processing operations were applied upon these 
triangular meshes, those points would be the only ones to be processed. The same 
operations applied upon the original images would require the processing of 262,144 
and 65,536 pixels respectively, which is between one and two orders of magnitude 
larger. 

The proposed technique produced better image approximations than both the uni- 
form sampling technique and the optimization based technique (Jade). For example, 
given the same number of points, the proposed technique always produced lower 
RMS errors (e.g., 9.8 and 10.6 in Fig. 7) than the uniform sampling technique (e.g., 

14.3 and 18.4 in Fig. 7) and than the optimization based technique (e.g., 11.7 and 

15.3 in Fig. 7). The reason is that the proposed technique explicitly models the dis- 
continuities in the image, while optimization-based techniques, such as Jade, are 
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only able to keep those discontinuities by concentrating large numbers of points. 
Moreover, the CPU times necessary to generate the adaptive triangular meshes cor- 
responding to the two previous examples were two orders of magnitude faster with 
the proposed technique (e.g., 7.64 and 1.26 sec) than with Jade (e.g., 1,580 and 675 
sec). 

Finally, the CPU times to perform the basic operations shown in Fig. 6 were mea- 
sured and compared with the times to perform similar operations with a conventional 
image processing software (CVIPtools) publicly available [14]. The images of 
Lenna shown in Fig. 6 are approximating images obtained from an adaptive triangu- 
lar mesh of 7,492 points computed with the proposed technique. This corresponds to 
an RMS error of 9.84 with respect to the original image, which contains 512x512 
(262.144) pixels. The CPU times to perform the translation with CVIPtools was 0.32 
sec., while the same operation in the geometric domain took 0.00087 sec. The rota- 
tion operation took 2.23 sec. with CVIPtools and 0.02 sec. with the proposed 
technique. Finally, the scaling operation took 0.33 sec. with CVIPtools and 0.02 sec. 
with the proposed technique in the 3D geometric domain. CVIPtools does not 
include any routines for producing deformations such as the elliptical one shown in 
Fig. 6. Therefore, it should be implemented with a user program that would access 
the given image, pixel after pixel, with the subsequent time penalty. Similarly, all the 
image deformations typically found in Adobe’s Photoshop-like image processing 
packages are easily implementable in the 3D geometric domain by trivial mesh 
deformations, requiring a fraction of the time utilized in the image domain. 

In all the examples considered in this section, the given times do not include the 
mesh generation and image reconstruction stages. The reason is that these stages 
must only be applied once: to map the original image to the geometric domain and to 
map the resulting mesh back to image space. If many operations are performed 
(chained) in the geometric domain, the overhead of those two stages will become 
negligible. 

6 Conclusions and Future Lines 

This paper presents a technique for approximating intensity images with discontinu- 
ity-preserving adaptive triangular meshes without optimization, and explores the use 
of those meshes to accelerate conventional image processing operations. The adap- 
tive meshes generated with this algorithm are obtained faster than with optimization- 
based algorithms and, since image discontinuities are explicitly handled, the results 
are better than the ones obtained through both uniform (non-adaptive) sampling and 
optimization-based algorithms. The paper also explores the utilization of adaptive 
triangular meshes for accelerating image processing operations. Basic translation, 
rotation, scaling and deformation operations have been developed in the geometric 
domain and compared to a conventional image processing software, showing a faster 
performance. 

We are currently developing new algorithms for implementing conventional pro- 
cessing operations (e.g., feature extraction, image enhancement, pattern recognition) 
directly in the geometric domain. The application of this technique to color images 
will also be studied. 
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Abstract. We introduce a 3D tracing method based on differential geo- 
metry in Gaussian blurred images. The line point detection part of the 
tracing method starts with calculation of the line direction from the ei- 
genvectors of the Hessian matrix. The sub-voxel center line position is 
estimated from a second order Taylor approximation of the 2D intensity 
profile perpendicular to the line. In curved line structures the method 
turns out to be biased. We model the bias in center line position using 
the first order Taylor expansion of the gradient in scale and position. 
Based on this model we found that the bias in a torus with a generalized 
line profile was proportional to . This result was applied in a procedure 
to remove the bias and to measure the radius of curvature in a curved 
line structure. 

The line diameter is obtained using the theoretical scale dependencies of 
the 0-th and 2nd order Gaussian derivatives at the line center. 
Experiments on synthetic images reveal that the localization of the cen- 
terline is mainly affected by line curvature and is well predicted by our 
theoretical analysis. The diameter measurement is accurate for diameters 
as low as 4 voxels. Results in images from a confocal microscope show 
that the tracing method is able to trace in images highly corrupted with 
noise and clutter. The diameter measurement procedure turns out to be 
accurate and largely independent of the scale of observation. 



1 Introduction 

Quantitative analysis of curvilinear structures in images is of interest in various 
research fields. In medicine and biology researchers need estimates of center line 
and diameter of line-like structures like blood vessels or neuron dendrites for 
diagnostic or scientific purposes P, P, P^. In the technical sciences people are 
interested in center line positions of lines in engineering drawings or automatic 
detection of roads in aireal images m, 12] 

What all center line detection methods have in common is the need for a 
criterion for a certain position in the image to be part of the center line of the 
line structure. Methods differ in the definition of such a criterion and in the way 
the criterion is evaluated for a certain point in the image. 
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In a first class of methods a line point is defined by local gray value maxima 
relative to neighboring pixels or voxels PI @ □ Since no reference is made to 
properties of line structures in an image this class will generate false hypotheses 
of line points if noise is present. In these methods line points are limited to 
discrete positions on the grid defined by the image. 

A suitable criterion for a line point is to consider it to be part of a struc- 
ture which length is much larger than its diameter 0 |^. This criterion can be 
materialized within the framework of differential geometry ii, 0, loi.ra. nn, 
Id, m. In differential geometry the starting point is the evaluation of deri- 
vatives of the gray value distribution in the image. The point at the centerline 
of a line structure is characterized by a relatively high second order derivative 
perpendicular to the line direction and a first order derivative which is zero 0, 

ra- 
in the facet model of Haralick m image derivatives in a 2D image are calcu- 
lated from a third order polynomial fit to the image data in a 5x5 neighborhood 
of the image pixel to be evaluated. The second order derivatives are used to 
estimate the line direction. Along the line perpendicular to the direction of the 
line structure the position where the first derivative vanishes is estimated from 
the polynomial. A pixel is declared a line point if this position is within the pixel 
boundaries. A drawback of this method is that the polynomial fit is sensitive to 
noise which may lead to poor estimates of the derivatives. A solution to this pro- 
blem is to calculate image derivatives by convolution of the image with Gaussian 
derivative kernels HH. Applying Gaussian kernels reduces the influence of noise 
and ensures meaningful regularized first and second order derivatives even in 
the case of plateau- like intensity profiles across the line. In 0 ra the recipe of 
Haralick to materialize the line point criterion was adopted and implemented for 
2D and 3D images using Gaussian derivatives up to order two. For straight line 
structures this method is bias free due to the isotropy of the kernels used. Howe- 
ver, for a curved line structure a bias in center line position is found m which 
increases with the scale of the Gaussian kernels. This problem is not addressed 
within the framework of the differential geometrical approach. 

Few line detection or tracing methods provide an estimate of the line width 
ira 0 dl nil • Ci] CH diameter is estimated from a fit of a function 

describing the line profile. These methods suffer from the same noise sensitivity 
as the facet model m- This problem can be avoided by using the scale depen- 
dency of normalized second derivatives to estimate line diameter |^. In no 
evaluation of the method is presented and the diameter is found by iteration 
over the scale which is computational expensive. The diameter measurement 
procedure presented in m fits a 3D B-spline to the vessel outer surface. It pro- 
duces accurate results but has the initialization problem inherent to all B-spline 
approaches. 

In this paper we present a 3D line tracer which uses the line point detection 
method as presented in m- The shift in line center due to line curvature is 
analyzed and a method to compensate for this shift is developed. Our approach 
for the measurement of diameter is based on the scale dependency of Gausian 
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derivatives in the image. In our diameter measurement procedure iteration over 
scale is not necessary if the shape of the gray value profile across the line is 
known. The performance of the line tracing as well as the diameter estimation 
method is evaluated using both synthetic images and 3D images of biological 
structures obtained with a confocal microscope HH 

2 Tracing of 3D Curvilinear Structures 

2.1 Tracing Procedure 

Our tracing procedure starts by selecting a discrete point Pd at position (x, y, z) 
in the image close to a center line position. For this purpose we use a 3D na- 
vigator/viewer as presented in |Sj. At this position we calculate the Gaussian 
derivatives up to order two. 



Fig. 1. 3D line structure with local Eigenvectors t, n and m of the Hessian Matrix. The 
Eigenvector t with the smallest Eigenvalue in magnitude is in the local line direction. 

The derivatives are calculated by convolving the image I{x,y,z) with the 
appropriate Gaussian derivative kernels of width cr I2D]. The second order deri- 
vatives are used to build up the Hessian matrix 



From H we calculate the 3 eigenvalues A*, A„, Am and the corresponding 
eigenvectors t, n and m. The eigenvectors (t, n and m) form an orthonormal 



Z 

A 





( 1 ) 
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base for a local Cartesian coordinate system through the center of the voxel. 
The vector t which is aligned to the line direction (see Fig. 1.) is the eigenvector 
with the smallest absolute eigenvalue A* 

In the plane perpendicular to the line direction the gray value distribution 
can be approximated by a second order Taylor polynomial 

i{i,v) = i + p-yi+\p-n-p ■ (2) 

In Eq.(2) p is a vector in the plane defined by n and m i.e. 

p = + 77m . ( 3 ) 

/ and V/ are the Gaussian blurred gray value and gradient vector at the current 
voxel position. The center line position Pc relative to the voxel center is found 
by setting the gradient of the local Taylor polynomial to zero m 

V/(e,T7) = 0 (4) 

and solving 77 and ^ from the resulting linear equation. The sub voxel center line 
position Pg is calculated by 



Ps=Pd + Pc ■ (5) 

In general Pg will not be within the boundaries of the current voxel. If that 
is the case the line point estimation procedure will be carried out again at the 
discrete voxel position closest to Pg. This procedure will be repeated until Pg is 
within the boundaries of the current voxel. 

The tracing proceeds by taking a step from the estimated position Pg in the 
t-direction and estimating a new position Pg as described above. 

2.2 Curvature Induced Bias in Line Center Position 

From scale space theory it is known that the position of a critical point like an 
extremum or saddle point in an image is dependent on the scale at which the 
image is observed (EH, m)- As a function of scale the critical point is moving 
along a trajectory in the 3-dimensional space of the image. Since a line point 
in a 3D image is a critical point (extremum) in the 2D sub space perpendicular 
to the line direction |23| a comparable mechanism will make the positions of 
the center line points shift as a function of scale. In straight lines we observed 
no significant shift due to the symmetry of the line and the isotropic properties 
of the differentiation kernels m- If the line is curved a bias from the true line 
positions is to be expected. 

Starting point in our analysis of the scale dependent position of center line 
points in space is a first order Taylor expansion of the gradient. As shown in 1221 
the Taylor expansion of the gradient in position and scale is given by 



V/(po -I- (5p, So + <5s) = H(po, So) • 5p -I- w(po, so) Ss ■ 



( 6 ) 
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In (6) p is an arbitrary position in space, s is the scale parameter (s = ) 

and po a critical point at scale sq- The matrix H(po,so) is the Hessian matrix 
and w(po,so) = A(V/(po, sq))- Setting the left part of (6) to zero yields the 
first order approximation relating the shift i5p in position of the critical point to 
a variation Ss in scale. In our analysis we choose sq = 0 and po at the center line 
of the curvilinear structure. Using (6) we can calculate the shift in line center 
position when the scale starts to increase. 

To investigate bias introduced in center line position due to line curvature 
and scale we model a curved line structure by a toroidal object. The center of 
mass of the torus is in the origin and the center line in the plane z = 0. The 
radius of curvature is given by Ht and i? is the radius of the intensity profile 
across the line (Fig.4.). 

Because of the cylinder symmetry of the toroidal object it is most convenient 
to use cylindrical coordinates for the calculation of the scale dependent line 
center shift. The coordinate transformation relating cylindrical coordinates to 
Cartesian is given by 



X = p cos 6 

y = psin9 (7) 

z = z . 

In a cylindrically symmetrical object like a torus all derivatives with respect to 
6 vanish which considerably simplifies the analysis. Since we are interested in 
the shift within the 2D sub space perpendicular to the line direction ( i.e. at 
constant 0 ) i5p is given by 



(5p = 



/ 6pcos9\ 
I Sp sin 9 I 

\ 6z J 



(8) 



By writing the derivatives appearing in (6) into cylindrical coordinates and using 
(8) for dp we arrive at two independent equations relating Sp, Sz and Ss 



(§ 7 ? ) ^P + 



dpdz J 



Sz - 



a^i I 1 ^ L 91 _i_ 

0p3 p 0p2 p2 dp W dz^dp j 



Ss = Q 









[dpd 






f _L 

I dp‘^dz ' p dpdz 



0) ds = 0 



(9) 



To be able to materialize relationship between dp and Ss from (9) we need 
expressions for the partial derivatives with respect to p and z at the center line 
of the original object (s=0). These expressions are governed by the shape /(r) 
of the line profile at zero scale. This intensity profile is considered to obey the 
general conditions 



/(r) = 



Iof{r), (r < R) 
0, (r > R). 



(10) 
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In equation (10) Jq is the gray value at the center line, r = r{p, z) is the distance 
from the center line position and R the radius of the line structure. The first 
derivative of /(r) is assumed to vanish at the centerline. 

Since we don’t want to be limited to only one single line profile type we 
perform the analysis for a more general profile type /(r(p, z))) which is given 
by: 



f{r{p,z)) 



(1 _ 

(1 - e -“)2 



( 11 ) 



with 



r{p,z) = ^/{p- RtY + z"^ . (12) 

Figure 2 shows that by changing a the shape of the profile can be adjusted. It 
can be proven that the line profiles corresponding to the limiting cases a — >■ 0 
and q; — >■ oo are related to a parabolic and a pillbox profile respectively. 




Fig. 2. Shape of the line profile for different values of a. The shape of the profile 
changes from parabolic for a — >■ 0 to pillbox shaped for a — >■ oo . 



Even for the general profile described by (11) most of the partial derivatives 
appearing in (9) vanish for points at the center line {p = Rt and z = 0). Only 
and turn out to be nonzero yielding a simple relationship between Sp , 
Sz and s 
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Sp=-^ (13) 

5z = G . (14) 

Equations (13) and (14) reveal that the shift APg = |i5p| is towards the center 
of mass of the torus and is not dependent on R. 

In practice the parameter of interest will be the shift normalized with respect 
to the radius R of the line structure. From (13) we can write the normalized shift 
APg/R in terms of (R/Rt) and a/R: 



APg 


1 


/ a \ 2 1 


f R\ 


R 


“ 2 ^ 


Vr) ' 


\Rt) 



The same analysis for a Gaussian profile shows that (15) holds for this profile 
type too. For a Gaussian profile /(r) is defined by: 

1 (r(p,z))^ 

f{r{p,z)) = e 2— r 2— , (16) 

To evaluate the range of applicability of (15) we measured APg as a function 
of scale in synthetic images of curved line structures. In each experiment 20 
line center positions were obtained by tracing a toroidal object. In all cases 
the variation in the measured AP^ is negligible compared to the average bias 
introduced by the curvature. Values of Rt/R of 3, 5 and 10 were chosen to 
represent a highly, a moderately and a slightly curved line structure respectively. 
For R we chose the values 3,5 and 10 to investigate the sensitiveness of (15) to 
the size of the object relative to the image grid. The profile shape parameter 
a was varied between 0 and 10 in order to cover a large range of profile types 
between a parabolic profile and a pillbox profile. The results of the localization 
experiments show the same shift-scale relationship for all a both in shape and 
magnitude. Figure 3 (left panel) shows as an example the results on a profile 
with a = 5. The same set of experiments on a torus with a Gaussian line profile 
yield comparable results (Fig. 3., right panel). 



2.3 Removal of Bias Due to Curvature 



In curved line structures the bias in the centerline position APg can become 
in the order of several voxels. For these cases the availability of a method to 
reduce this bias is desirable. The correction procedure which we propose starts 
by calculating Pg at two scales a\ and U 2 with corresponding centerline positions 
Pgi and Pg 2 (ct 2 > CTi). From Pgi and Pg 2 it is possible to calculate the normal 
vector n which points towards the center of mass of the torus (Fig. 4.). 



Ps2 — Psl 
|Ps2-Psl| 



(17) 



Equations (13) and (14) imply that n is pointing towards the center of mass of 
the torus in the plane defined by the centerline of the torus. 
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Fig. 3. APs/R as function of | (^) ^ profile with shape parameter a = 5 

(left panel) and for a Gaussian profile (right panel). The solid straight line shows the 
theoretical first order relationship between scale and center line shift. 




Fig. 4. Torus with radius of curvature Rt and radius R of the intensity profile across 
the line. Genterline positions Psi and Ps 2 are obtained at scales cri and <T 2 respectively. 
Ps is the unbiased centerline position at cr = 0. 
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Using equation (13) it is straightforward to show that that the distance between 
two position vectors Pgi and Ps 2 is given by 



|Ps2 ~ P si 




(18) 



The value of Rt is found by solving (18) for Rt 

‘ 2|Ps2-Psi| 



(19) 



The corrected Pg is found by subtracting the shift along n at scale Oi from the 
estimated Pgi, i.e. 



Ps = Psi-^n (* = 1,2) 



( 20 ) 



The correction method was evaluated using synhetical images of tori again. Fi- 
gure 5 shows an example of the performance of the method {cr/R = 1). After 
correction the relative bias APg /R is negligible for relative curvature Rt /R > 3 
which is in correspondence with the theory (cf. Fig. 3.). 




R,/R 



Fig. 5. Correction for bias introduced by curvature. Both the bias as a result of the 
tracing procedure (dots, a\jR= 1) and the bias after correction (squares) are shown. 
For C72 a value of l.Scri was chosen. 



2.4 Tracing in Confocal Images 

The applicability of the tracing method is illustrated in two different examples 
of 3D biological curvilinear structures (Fig. 6.). The images were obtained using 
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Fig. 6. Tracing results in 3D images of biological specimen (left: a Spathiphyllum 
pollen grain, right: a pyramidal neuron cell). The images were obtained using a confocal 
microscope. The black dots represent the estimated center line positions. The central 
image shows an enlarged part of the Spathiphyllum pollen grain containing a highly 
curved line segment {jy/R =1). 

a confocal microscope. The experiments reveal that the center line estimation 
method converges to the optimal center line position as long as the scale of 
the differentiation kernels was chosen larger than the radius R of the intensity 
profile across the line. Taking this constraint imposed on the scale into account 
the method is capable of measuring center line positions even in the noisy image 
of the neuron cell (Fig. 6., right image). In the image of the Spathiphyllum 
pollen grain (Fig. 6., left image) the method allows for tracing highly curved 
line segments. In these segments we observed the scale dependent line center 
shift as predicted by our theoretical analysis (Fig. 6, central image). 

3 Diameter Estimation of Pillbox Shaped and Parabolic 
Profiles 

3.1 Single Scale Diameter Measurement 

For diameter estimation it is necessary to take the shape of the 2D gray value 
profile perpendicular to the line into account 0. The profile is assumed to obey 
the general condition as mentioned in (10). 

We use the scale dependencies of J(r) convolved with a Gaussian and the 
second Gaussian derivatives of J(r) at r = 0 to estimate the line diameter. For 
this purpose expressions are derived for the Gaussian blurred intensity I{R,a) 
and the Laplacian A^I{R,a) restricted to the span of n and m : 



In (21) and (22) g{r,a) and grr{r,cr) are the 2D Gaussian and its second deri- 
vative in r-direction. The expressions for I{R,a) and the Laplacian A^I{R,a) 




0 Jo 



(21) 




( 22 ) 
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are used to construct a non-linear filter which is rotation invariant with respect 
to the line direction and independent of Jq ^ 



h{R, a) 



a) 



(23) 



The denominator in (23) represents the 2D Laplacian based on normalized se- 
cond derivatives m- 

The theoretical filter output h{R,a) is dependent on the choice of /(r). For 
a parabolic and a pillbox profile the integrals appearing in (21) and (22) can be 
evaluated analytically. For a pillbox profile we find 



h{q) 



(l-e~^) 

{qe~i) 



(24) 



and for a parabolic profile 



h{q) 



(l-e-^)-g 
{qe~t) — (1 — e~t) 



(25) 



with 




(26) 



Equations (24-26) show that for the pillbox and the parabolic profile h{R^ a) is 
only dependent on the dimensionless parameter q. Figure 7 shows that h{q) is 
monotonous increasing functions of q. This property makes it easy to estimate 
q from a measured filter output hmeas ■ If hmeas and the shape of the profile are 
known q can be estimated by solving one of the equations 

h{q) - hjneas = 0 . (27) 



By a simple bi-sectioning method the root q^ of (27) is found. The corresponding 
R is found using by solving (26) i.e. 

R = a \/2qo . (28) 



3.2 Experiments on Diameter Measurement 

To evaluate the performance of the line diameter estimation method synthetic 
images containing straight line segments with circular cross section were used. 
The diameter of the line segment was varied between 2 and 15. Both a pillbox 
shaped and a parabolic intensity profiles were evaluated. The diameter estimate 
turned out to be independent of the setting of a in the range where 0.2 < R/a < 
2. In the synthetic images the bias in the estimated diameter is always below 5 % 
(see Fig. 8.). The diameter measurement procedure was also tested on a biological 
3D image from a confocal microscope containing a Spathiphyllum pollen grain 
(Fig. 6., left panel). The diameter was measured at scales varying between 2.0 
and 5.0. Within this range of scales the diameter measurement deviates only 6 
% for the parabolic profile as a model used (Fig. 9.). In case the pillbox profile 
is used the diameter measurement fails. 
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Fig. 7. Theoretical line diameter filter output h{q) for a pillbox profile (solid line) and 
a parabolic profile (dashed line). Parameter q is dimensionless and only dependent on 
R and cr (g = | {R/a)^ ). 

4 Discussion 

We presented methods for 3D center line detection and diameter measurement 
using regularized derivatives in the image. The scale dependency of the deriva- 
tives at or near center line positions proved to be useful for the measurement of 
diameter and for modeling and compensation of bias due to curvature. 

The theory developed in section 2.2 provides fundamental insight in the pa- 
rameters governing the shift in center line position due to line curvature. The 
absolute shift turns out to be proportional to and inversely proportional to 
Rt- The inverse proportionality to Rt corresponds to the fact that for infinite 
Rt the line structure is straight yielding a bias free center line detection. The 
proportionality to accounts for the fact that the localization of the center line 
will be best at small scales of observation. However, in practice it is not always 
possible to select a small scale due to noise or irregularities in the image. For 
these cases the compensation method is most useful. 

The simple shift-scale relationship expressed in (13) is valid for the family 
of line profiles given by (11). We believe that this family covers a large range 
of profiles encountered in 3D images. The validity of (13) can be extended by 
examining equation (9) more closely. From (9) we conclude that equation (13) 
holds for all line profiles for which the third order derivatives and the mixed 
second derivatives vanish at the center line. 

In the derivation of the bias removal method a procedure to measure local 
line center curvature naturally appears. After measuring the sub voxel center 
line position Pg at two different scales equation (19) can be used to compute 
Rt- It is advantageous to have a measure for curvature at center line positions 
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AR/R 




Fig. 8. Relative bias in estimation of radius R of the line structure as a function of R. 



since it is not possible to use isophote curvature |3] • Although a center line can 
locally be approximated by an isophote it is not possible to use the gradient 
based isophote curvature because at the center line position the gradient is zero 
and has an undefined direction. 

Figures 3 and 5 show that the bias removal yield accurate results for values 
of APs/R up to 0.2. This covers a large range of Rt,R,a combinations which 
are of practical importance. 

The experiments on diameter measurement confirm the value of our approach 
in which we use the theoretical scale dependencies of the derivatives at the 
center line position. In synthetic images the diameter has negligible bias in line 
structures with diameters as low as 4 voxels and is independent of a for all 
scales. In biological images the scale independence is somewhat worse but still 
good enough to be scale independent in practice (Fig. 9.). Using the incorrect 
line profile model might induce a bias in the measured diameter. In the extreme 
case that a pillbox profile would be used in the diameter measurement of a profile 
which is actually parabolic, the bias is about 25 percent. This can be considered 
the maximum possible bias in this method when no a priory information on the 
profile shape is available. 

The tracing method presented here is capable of measuring center line po- 
sition accurately even in images containing highly curved line segments. The 
line diameter measurement has proven to be accurate in biological images and 
largely scale independent. 
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Abstract. Perceptual experiments indicate that corners and curvature 
are very important features in the process of recognition. This paper 
presents a new method to efficiently detect rotational symmetries, which 
describe complex curvature such as corners, circles, star- and spiral pat- 
terns. The method is designed to give selective and sparse responses. It 
works in three steps; first extract local orientation from a gray-scale or 
color image, second correlate the orientation image with rotational sym- 
metry filters and third let the filter responses inhibit each other in order 
to get more selective responses. The correlations can be made efficient 
by separating the 2D-filters into a small number of ID-filters. 

These symmetries can serve as feature points at a high abstraction le- 
vel for use in hierarchical matching structures for 3D-estimation, object 
recognition, etc. 



1 Introduction 

Human vision seems to work in a hierarchical way in that we first extract low 
level features such as local orientation and color and then higher level features 
0. There also seem to exist lateral interactions between cells, perhaps to make 
them more selective. No one knows for sure what these high level features are 
but there are some indications that curvature, circles, spiral- and star patters are 
among them mu, US]. And indeed, perceptual experiments indicate that corners 
and curvature are very important features in the process of recognition and one 
can often recognize an object from its curvature alone m, HE!- They have a 
high degree of specificity and sparsity. As they are point features, they do not 
suffer from the aperture problem usually encountered for line and edge structures 
0. This paper describes a procedure to detect the features mentioned above. 
First a local orientation image is calculated. Second this image is correlated with 
a set of filters to detect complex curvature features. Finally a lateral inhibition 
procedure is used to make these filter responses more selective and sparse. These 
symmetries can serve as feature points at a high abstraction level for use in 
hierarchical matching structure for object recognition and estimation of 3D- 
structure. One examples is detection of traffic circles, crossroads (star shapes) 
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and bends of the road (high curvature points) in aerial images which is useful 
in navigation of autonomous aircrafts. Another example concerning autonomous 
trucks is mentioned in the end of this paper. 



2 Notations 

X = {x\,X 2 )i y = (yi, 2 / 2 ) denote Cartesian coordinate values, r and ip denote 
polar coordinate values and u denote Cartesian coordinates in the frequency do- 
main. Remaining boldface letters, z, a, b etc., denote real or complex functions. 
z(x) means the value at pixel x and z means the matrix containing the function 
values at all pixels. 

Products d ab and divisions e = a/b between matrices are exclusively defined 
to be pointwise operators, i.e. d(x) = a(x)b(x), e(x) = a(x)/b(x). 

Inner scalar product: (b,f) = E b-(y)f(y) (1) 

Correlation: (b*f)(x)= b*(y)f(y — x) (2) 

y^ZZ^ 

where * denote complex conjugate. b*f is then also an image (assumed cut to 
the same size as f). (Do not confuse this with convolution, b * f, where the filter 
b is mirrored instead of conjugated.) 



3 Local Orientation 

The classical representation of local orientation is simply a 2D-vector pointing in 
the dominant direction with a magnitude that reflects the orientation dominance 
(e.g. the energy in the dominant direction). An example of this is the image 
gradient. 

A better representation is the double angle representation, where we have a vector 
pointing in the double angle direction, i.e. if the orientation has the direction 6 
we represent it with a vector pointing in the 20-direction. Figure^ illustrates the 
idea. The magnitude still represents our confidence in the dominant orientation 



iO 

e 

i /2 



13 /2 



Fig. 1. Local orientations (left) and corresponding double angle descriptors as vectors 
(middle) and as complex numbers (right) 
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and that is sometimes referred to as a certainty measure (often denoted c). The 
concept double angle representation was first mentioned in ini- It has at least 
two advantages: 

• We avoid ambiguities in the representation: It does not matter if we choose 
to say that the orientation has the direction 9 or, equivalently, 0 + tt. In the 
double angle representation both choices get the same descriptor angle 29 
(modulo 27 t). 

• Averaging the double angle orientation description field makes sense. One 
can argue that two orthogonal orientations should have maximally diffe- 
rent representations, e.g. vectors that point in opposite directions. This is 
for instance useful in color images and vector fields when we want to fuse 
descriptors from several channels into one description. 

This descriptor will in this paper be denoted by a complex number z where 
z = z/|z| points out the dominant orientation and c = |z| indicates the certainty 
or confidence in the orientation identified. 

A definition before we continue: 

Definition 1. A simple signal I : K" — >■ IR defined as J(x) = /(x • in) where 
/ : IR — >■ IR and xa. is a fix vector S IR" . 

Figure El shows some examples of simple and non-simple signals. The magnitude 




Fig. 2. Examples of a simple signal (left) and non-simple signals (middle and right) 



of the double angle descriptor is usually chosen to be proportional to the dif- 
ference between the signal energy in the dominant direction and the energy in 
the orthogonal direction. In this case the double angle descriptor cannot distin- 
guish between a weak (low energy) simple signal and a strong non-simple signal 
(e.g. noise with a slightly dominant orientation). We will get a low magnitude 
c = |z| (indicating no dominant orientation) in both cases which sometimes may 
be undesirable. 

Another, more comprehensive, representation of local orientation is the concept 
of tensors |E]. A 2 D tensor is a 2 x 2 tensor (matrix) T = AiOie^ -|- A 262 e^ 
where Ai denote the energy in the dominant orientation direction ei and A 2 the 
energy in the orthogonal direction 02 (by this definition we get Ai > A 2 and 
for a simple signal A 2 = 0). From this tensor we can calculate a double angle 
descriptor e.g. by 
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This detour (gray-level image tensor double angle descriptor) seems unne- 
cessary, but in this way we can control the selectivity for simple/non-simple 
signals and the energy sensitivity separately (for instance a weak simple signal 
can now get a higher magnitude |z| than a strong non-simple signal). We can 
also note that the magnitude lies in the interval [0,1], with jzj = 1 indicating 
high confidence and jzj = 0 low confidence in the dominant orientation. 

There are different ways to compute the tensor in practice. One way is to look 
at the local energy in different directions using a set of quadrature filters and 
weigh them together in a proper way |S|. The filtering can be efficiently com- 
puted by use of sequential filter trees Pj. Another way to compute a tensor is 
to approximate (by weighted least square) the local neighborhood /(x) with a 
polynomial of e.g. second degree 

/(x) ~ x"^Ax + b^x -k c (4) 

and from the coefficients create a tensor description 

T = AA^ -k ybb^ (5) 

(where 7 is a non-negative weight factor between the linear and quadratic terms) . 
This local polynomial approximation can be done by means of convolutions 
which in turn can be made separable leading to a very efficient algorithm, see 
0. The second method is at present a little faster and is the one used in this 
paper. To compute all local orientations in 9 x 9 neighborhoods in a 256 x 256 
gray-level image takes in MATLAB a couple of seconds on a 299MHz SUN Ultra 
10 . 

It is assumed from now on that we have access to orientation transform 
images in the double angle representation, denoted z. 



4 Rotational Symmetries 



4.1 Basics of Rotational Symmetries 

Figure [^contain two image patterns together with their corresponding orienta- 
tion descriptions in double angle representation (the orientation description is 
only a sketch - the certainty c is high in all areas where we have an approxima- 
tely simple signal and fairly high signal energy). The circle can be described in 
the orientation transform as z = where Cdrcie = jzj. Thus if we want 

to detect circles we could simply correlate the orientation image with the filter 



w(r, (/?) = a(r)e*^‘^ where a(r), e.g., 



f 1 if r < i? 
( 0 otherwise 



( 6 ) 



The correlation at x = 0 becomes 



(w ★ Zc^rc/e) (b) — (w, Zcirc/e) — (ac ^ y^circle^ — {^^^circle) (7) 
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Fig. 3. Two patterns, a circle and a star, together with their corresponding orientation 
description in double angle representation 



The correlation at x 0 will be complex but give magnitudes less than (a, 
Ccircie)- It is interesting to note that the same filter also can be used to detect 
star patterns. The phase of the double angle description of a circle and a star 
only differs by a constant tt (see Fig.OJ. Therefore if we correlate the circle filter 
with an orientation image describing a star pattern we get 

(w*Z^tar)(0) = {w,Zstar) = = e“(a,Csjar) (8) 

{^circle and Cstar does not have to be equal). 

Consequently the filter can detect, and distinguish, between a whole class of 
patterns. The phase of (w, z) represents the class membership and the magnitude 
indicates the confidence in this membership. 

The filter described above is only one of many in a family of filters called 
rotational symmetry filters: 

n:th order rotational symmetry filter 
w„(r,(/?) = a(r)b„((/?) where b„((p) = (9) 



a(r) is called the applicability function and can be thought of as a window for 
the basis function b„ which is called circular harmonic function. The n:th order 
filter detects the class of patterns that can be described by (they are 

said to have the angular modulation speed n) . Each class member is represented 
by a certain phase a. One can also think of the filter responses as components 
in a polar Fourier series of local areas in the orientation image. 

The curious reader might wonder what these patterns look like. To answer this 
one has to start with the orientation image and go backwards to the 

original image, I, from which the orientation image was calculated. This ’inverse’ 
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is of course not unambiguous, but we get a hint by making the assumption that 
the image gradient is parallel to the dominant orientation: 

VI = (10) 



(We have to divide the phase by two to get rid of the double angle representation. 
The price is the direction unambiguity.) This will give a differential equation 
that can be solved assuming polar separability. The solution (not derived here) 
becomes: 






cos(( I -!)(/?+ f) n^2 

(^^l-n/2g(l-n/2) tan(a/2)(^ n = 2 



( 11 ) 



One way to visualize this is to plot level curves (1 + cos(wI))/2 (idea from 
These are also a solution to the differential equation. Figure 0 contains 
a sample of these psychedelic patterns. The most useful rotational symmetry 
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Fig. 4. Corresponding gray-level patterns to some rotational symmetries 



filters (i.e. most common visual patterns in our daily lives) are the 

— zeroth order, n = 0: Detects lines (simply an averaging of the orientation 
image) . 

— first order, n = 1 (also called parabolic symmetries, referring to the gray- 
level patterns the filter are matched to): Detects curvature, corners and 
line-endings. This is not an optimal corner detector but it is rather robust 
and will give a high response to a variety of corner angles. The direction of 
the corner equals the filter output phase. 

— second order, n = 2 (also called circular symmetries): Detects stars, spiral 
and circular-like patterns. 

The three basis functions bg, bi and b2 are shown in Fig. Elbelow. 

The rotational symmetry filters were invented around 1981 by Granlund and 
Knutsson and have been mentioned previously in literature Q, ^2]) 
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Fig. 5. The three most useful circular harmonic functions, b„ = plotted in vector 
representation 



4.2 Choice of Applicability Fhnction 

The choice of applicability function a depends on the application. We can for 
instance choose a Gaussian function or a table like the one in (0 above. These 
have one drawback though. In a general image we usually have more than one 
event (pattern) and we only want to detect one at a time in order to avoid inter- 
ferences. In the case of rotational symmetries we measure the angular variation. 
This can differ from one radius to another. One could therefore argue that we 
should look at one fix radius at a time. 

As an example look at the bicycle wheel in Fig. 0 It consists of two events; a 




• o 



Fig. 6. Bicycle wheel 



Fig. 7. Three applicabilities a(r) 



star pattern at a small radius and a circle pattern at a larger radius. If we cal- 
culate an orientation image of the wheel and apply the circular symmetry filter 
W 2 = a(r)e®^'^ with the leftmost applicability in Fig. 0we will get a high filter 
output at the center of the wheel with the phase tt, representing a star shaped 
pattern. If we instead take the rightmost ring-shaped applicability which mainly 
’sees’ the circle, we get a high filter output with phase 0. If we on the other 
hand use the middle applicability which ’sees’ both events they will probably 
cancel each other out somewhat giving a low filter output with random phase. 
Differently argued one can also say that we are measuring the change in the 
(/^-dimension and should therefore hold the other dimensions (the radius) fairly 
constant0 From this we can conclude that the descriptor is dependent upon 
scale, like in the description of most other properties. Thus an applicability fun- 
ction localized around a certain radius seems to be a better choice in the case of 



^ Also compare with derivation: If you e.g. want to calculate d\jdx with a filter it is 
probably a good idea to make the Hlter fairly narrow in the j/-direction. 
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multi-events (multi-scales). 

What is the best shape of the applicability? We can for instance choose the right- 
most ring-shaped applicability in Fig.|3 However, a more fuzzy applicability like 
the one in Fig.|3 giving smoother and better controlled responses, will probably 
be a better choice. The applicability in this figure is a simple difference of Gaus- 




Fig. 8. Cross-section of an smooth applicability localized around a center radius 



sians. This applicability has a further advantage in that it can be separated into 
a few ID-filters, see Sec.|71 

5 Normalization 

Look at the two star patterns in Fig. 0 If we make orientation images for these 



two stars and correlate with the circular symmetry filter W 2 = a(r)e®^‘^ we will 
in both cases get a high filter response with phase tt. The response for the left 
star will however be lower than the response for the right star simply because 
we have less orientation information in the first case, i.e. (a, Cstari) < (a, Cstar2)- 
This may not be desirable. If we want to detect star shapes we do not care if 
the star has eight or twenty points, if it is composed of thick or thin lines and 
so on. Similar problems occur for other patterns. 

This problem can be solved by normalization. We normalize the filter response 
with the amount of information we have available, weighted with the applicability 
(remember that the division is a pointwise operator) : 0 

^ Normalization can also be seen as a projection of z onto the basis function b„ in a Hil- 
bert space with the (weighted) inner product (b,z)ac = ^ c^afeZ/cb* = (ab„,cz) 




|(W2, Zstarl)! < | (W2 , Zstar-2 ) | 



Fig. 9. Two star patterns 
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Normalization of filter responses 



Sn 



(cz) ★ (ab„) 
c ★ a 



( 12 ) 



s„(x) has the useful property |s„(x)| < 1. If the orientation z is consistent with 
the filter basis function b„ we will get |s„| = 1 independently of the amount 
of orientation information (except in the degenerated case c = 0). If the total 
orientation information is too low we cannot rely on the result s„. In the ori- 
entation image we had a certainty measure c indicating how well we can rely 
on the orientation information z. We can use the same formalism for rotational 
symmetry images, call it symmetry certainty Cg, indicating how well we can rely 
on s„. For example we can use 



Symmetry certainty 




where 1 should be interpreted as an image with the constant value one, i.e. l*a 
equals the sum of all coefficients in a except near the image border, and m is a 
mapping function serving as a fuzzy threshold, e.g. 

/ N ((1 - a)xy 

TT-. ^ To ' 7 yj \\g (14) 

((1 — a)x)P -I- (a(l — 

a determines the threshold level and (3 the ’fuzziness’. Figure nT!l illustrates the 
mapping function for some different values a, (3. In this paper however we will 
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Fig. 10. The mapping function ma,p for some values a, (3 



use the simple case a = 0.5 and (3=1 giving ma,i 3 {x) = x and Cg = i.e. Cg 
is a weighted averaged input certainty. 

because (b„, b„)ac = (a, c), where (., .) is the normal scalar product defined in Sec.|^ 
above. 

This is actually a special case of normalized convolution im> u where you 
project z on a subspace spanned by several basis functions (this require knowledge 
about dual basis theory). 
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6 Normalized Mutual Inhibition 

Figure ITTI [El and El illustrates the procedure gray level image ^ orientation 
image ^ rotational symmetry images (the choice of applicability is not critical 
in this case, the scale is about the same size as the circle). Figure fTTI shows the 



Image 



zl= c 




20 40 60 



20 40 60 



Fig. 11. Left: Gray-level test image; a rectangle and a circle. Right: Magnitude of 
the orientation image (white indicates high magnitude) 




Fig. 12. Magnitude of normalized filter responses s„, n = 0, 1, 2. Rightmost: Symme- 
try certainty Cs = 
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Fig. 13. Magnitude of normalized filter responses s„,n = 0,1,2 weighted with the 
certainty Cs (equivalent to w„ * z) 



test image and corresponding orientation image. Figure Elshows the normalized 
responses s„, n = 0, 1, 2 and their certainty In Fig. Elthe normalized 

responses are weighted with the certainty. With this simple choice of Cg we get 



(cz) * (ab„) 



c ^ a 



c ★ a 
1 ★a 



(cz) * (ab„) 



1 7k^a 



(cz) * 

( 15 ) 

where a = is a normalized applicability. This is the same as if we had filtered 
without normalization (except for a instead of a). As we soon will see it is still 
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useful to divide the result into the terms s„ and Cg. 

Take one more look at Fig. uni As you can see the first order symmetry filter 
Si detect corners and the second order filter S 2 detect circles but they also give 
some response to other patterns than we might wish them to do. This is because 
we do not have orientation information in the whole applicability window. For 
instance a line in the outer field of the applicability window might as well be a 
piece of a circle if you do not have contradictory orientation information in the 
rest of the window, a corner is almost the same as half a circle etc. 

We can make the filters more selective by letting the normalized responses inhibit 
(or punish) each other (it is much easier to inhibit with the normalized responses 
compared to the unnormalized because we know that |s„| < 1): 



Normalized inhibition 




Sn = S„ J]^(l - mfe(|Sfe|)) 
k^n 


(16) 



where mk can be some fuzzy threshold function controlling the power in the 
inhibition, e.g. the one in da. If one filter output is high the others will get low. 
This idea is inspired from lateral inhibition in biological systems and is a well 
known concept but has never before been used to make rotational symmetries 
more selective. 

FigurelTIlillnstrates the result after inhibition (m^ is ignored here). The inhibited 
responses are weighted with the certainty Cg. The inhibited first and second 



SiCg S2Cg 




20 40 60 20 40 60 



Fig. 14. Inhibited rotational symmetry responses 



symmetry responses, Si and § 2 , are the most interesting ones (but we still need 
to calculate Sq for use in the inhibition procedure). 



6.1 Comparison to Previous Methods 

A previous solution to achieve better selectivity is described in a patent from 1986 
H3 and is used in a commercial image processing program by Context Visioi0. 
The patent describes a method termed consistency operation. The name refers 

® http://www.contextvision.se/ 
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to that the operator should only respond to signals that are consistent with the 
signal model. The method is based on a combination of four correlations: 

hi=w„*z , h 2 = w„*c , h 3 =a*z , h 4 = a*c (17) 

The filter results are combined into 

h4hi-h2h3 

h=— ( 18 ) 

where the denominator is an energy normalization controlling the model versus 
energy dependence of the algorithm. With 7 = 1 the output magnitude varies 
linearly with the magnitude of the input signal. Decreasing the value increases 
the selectivity. The result from this operation is called divcons when n = 1 and 
rotcons when n = 2. 

The result when applying this method (with 7 = 1) on the orientation image 
in Fig. [n is shown in Fig. [El The divcons result can be fairly comparable 



Divcons Rotcons 




20 40 60 20 40 60 



Fig. 15. Increase of selectivity using the consistency operation 



to SiCs in this case but the rotcons result is much less selective than SiCg. The 
consistency operation has fairly the same behavior as an inhibition with only the 
zeroth order symmetry response (line patterns). This means for instance that 
the corners are not inhibited from the second order response but are instead 
detected as ’half circles’. In general, using normalized inhibition instead of the 
consistency operation produces more selective and sparse responses. 



7 Separable Filters 

On the subject of computational complexity we can note that we have to compute 
four correlations in the normalized inhibition procedure: 

(wq*z) , (wi*z) , (W2*z) and (a * |z|) = (wq * |z|) (19) 

The filters have to be large if the want to find large patterns and the correlations 
becomes computationally demanding. However there are much redundancy in 
the filters and they can be separated into a few ID-filters using SVD (Singular 
Value Decomposition). A filter kernel w of size N x N can be composed into 
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three N x N matrices, w = USV^ = J2k=i where Ufc and are the 

k:th column in U resp. V, UU^ = VV^ = I and S = diag{ai,a 2 , ■■■) (where 
O'! > (T2 > ...)• If only a few ak, say M values, is high we can approximate the 
filter kernel by M terms in the sum, w Ri w = X]fc=i which corresponds 

to 2M ID-filters. 



Example 1. Let iV = 21 and use a difference of Gaussians as applicability (as in 
Fig. |S|): 




cr = ro 



1- 1/(52 

In <52 



( 20 ) 



where ro is the center radius and 0 < (5 < 1 is a user parameter controlling the 
thickness of the applicability. If we choose ro = 3 and 5 = 0.5 and compute the 
SVD for Wo = a(r), Wi = a(r)e*‘^ and W2 = a(r)e^*‘^ respectively we get 



diag{So) = (3.6124,0.8469,0.0000,0.0000,0.0000,...) 

diag{Si) = (2.6191, 2.6191, 0.1535, 0.1535, 0.0177, ...) (21) 

^^05(82) = (2.6199,1.8552,1.8552,0.1402,0.0056,...) 

It is thus sufficient to approximate Wq with two terms (four ID-filters), wi with 
two terms (four ID-filters) and W2 with three terms (six ID-filters) and the rela- 
tive error about 5% for both Wi and W2. The reader can verify that 

if we had chosen a Gaussian applicability instead of a difference of Gaussians, we 
would have to use more terms than above to get a good approximation (except 
for Wo which can be separated into only two ID-filters). 



8 Experiments 

This section illustrates how the symmetry filters can be use with two examples. 
Space limitations do not allow any detailed explanations. 



8.1 Experiment 1: Autonomous Truck 

Figure M shows an image of a pallet together with the magnitude (orientation 
certainty) of its orientation image. This is one type of image used in a robotics 
project, where the goal is to have an autonomous truck locate the pallet and 
pick it up. The orientation image was down-sampled before correlation with the 
symmetry filters (using a Gaussian filter with std = 1.2). The inhibited first and 
second order symmetries are shown in Fig. [n The size of the filter kernel is 
50 X 50 with a center radius 6 (the separable filter technique described above 
was used). The intensity represents the magnitude and the vectors represent the 
value at the largest maxima points. There are a large number of responses and 
the output looks complex, which is due to the fact that the input images are 
indeed complex. 

The second order symmetry filter gives characteristic blobs with zero phase on 
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the pallet indicating circular-like patterns (in this case squares). The circle pat- 
terns can be extracted from the second order response by e.g. taking the real 
part and ignoring negative values, see Fig. EBl The pallet has a characteristic 
signature of three large blobs and some smaller in between, see Fig. Hi The 
whole procedure from the gray-level image to the circle image took about 6 se- 
conds on a 299MHz SUN Ultra 10. The circle image can be further processed 
using Hough transform to find ’lines’ formed by the blobs and the correct line 
(the one in Fig. E) can be found by testing against the characteristic signature 
hypothesis. In spite of the complexity of the image, the linear set of circular 
structures turns out to be quite unique. When the pallet is found in the image 
it is possible to calculate pallet direction, distance and orientation in the real 
world using the information in the symmetry image and the local orientation 
image. 

This technique has been tested on 28 images containing pallets. The distance 
between the camera and the pallet varied in the range of 1 to 16 meters (the 
symmetries was detected in several scales) and the orientation of the pallet re- 
lative to the camera varied between 0 to 40 degrees. The result was promising 
but the number of test images is too small to reach a final conclusion. Details 
about this project can be found in |2|. 

8.2 Experiment 2: Aerial Image 

Another example is generation of features for use in navigation of autonomous 
aircrafts. The features can be used in template matching or local histogram mat- 
ching to help the aircraft find landmark pbjects to establish its position. Figure 
EDI contains an aerial image and its orientaton magnitude. The inhibited first 
order symmetry is shown in Fig. E] (as before, the magnitude represents the 
intensity and the vectors represent the complex value at the largest maxima 
points). The first order symmetry detects the corners in the crossroad and cur- 
vature in general. The second order symmetry can be used to find traffic circles, 
crossroads, houses, and other circular-like patterns. Detected circular-like pat- 
terns (c.f. Fig. 1 1 811 in two different scales is shown in Fig. E21 The rotational 
symmetry features is more robust to illumination and seasonal variations than 
the original gray-level image which makes them suitable in a matching process. 
In spite of the apparent complexity, the particular pattern of such features turns 
out to be quite specific. Such a higher level matching is likely to take place in 
human vision as well. This experiment is part of the WITAS-project, see Q, |H|. 
Future work includes multiscale detection and testing of robustness and invari- 
ance to viewpoint, seasonal variances, illumination etc. 
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Fig. 16. An image of a pallet (left) and its local orientation magnitnde (right) 
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Fig. 17 . Inhibited rotational symmetry responses. Left: first order, Right: second order 
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Fig. 18. Circular-like patterns ex- 
tracted from second order symme- 
try. The circle features forms a line 
at the pallet position 
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Fig. 19. Characteristics along a 
line in Fig. [i^ fullfilling structural 
hypothesis for the desired object 
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Fig. 22. Circular-like patterns extracted from second order symmetry using center 
radius ro = 5 (left) and tq = 15 (right) 
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Abstract. This paper extends the recovery of structure and motion to image se- 
quences with several independently moving objects. The motion, structure, and 
camera calibration are all a-priori unknown. The fundamental constraint that we 
introduce is that multiple motions must share the same camera parameters. Exi- 
sting work on independent motions has not employed this constraint, and therefore 
has not gained over independent static-scene reconstructions. 

We show how this constraint leads to several new results in structure and motion 
recovery, where Euclidean reconstruction becomes possible in the multibody case, 
when it was underconstrained for a static scene. We show how to combine motions 
of high-relief, low-relief and planar objects. Additionally we show that structure 
and motion can be recovered from just 4 points in the uncalibrated, fixed camera, 
case. 

Experiments on real and synthetic imagery demonstrate the validity of the theory 
and the improvement in accuracy obtained using multibody analysis. 



1 Introduction 

This paper addresses the recovery of structure and motion from image sequences where 
the scene consists of several independently moving objects, and the motion, structure, 
and camera calibration are a-priori unknown. Figure [fl shows an example sequence. 
We show that by simultaneously analyzing multiple motions, more information can be 
recovered than is possible from each motion alone. The new constraint that we introduce 
is that although the motions are independent from frame to frame of the image sequence, 
the same camera views the 3D scene in each case. This means that in a Euclidean 
reconstruction the camera internal parameters are common across objects for each frame, 
and the absolute conic and the plane at infinity are common to the reconstructed objects. 
The shared camera parameters constrain the possible reconstructions of the moving 
scene. 

The recovery of 3D from 2D up to now has depended on the presence of a single 
motion, often expressed as the “static-scene” assumption. The advantage that might be 
gained by multiple motions has been somewhat overlooked. Instead multiple motions 
have often been an irritant, and have at best been treated as a segmentation problem. In 
fact, multiple motions can contribute in a number of ways to solving the SFM problem. 
To give two examples: 

D. Vemon (Ed.): ECCV 2000, LNCS 1842, pp. 891-TO^ 2000. 

© Springer- Verlag Berlin Heidelberg 2000 
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Fig. 1. Frames from an image sequence containing independent motions. The cake tin is revol- 
ving as the camera moves through the scene. Simultaneous analysis of the motions improves 
reconstruction. 



I. Two views: In the case of two views of a static scene it is well known that the 
Kruppa equations 0 provide two constraints on the camera internal parameters. If all 
the parameters apart from the focal length are known, then the focal lengths of the 
two views can he computed from the fundamental matrix mnm . Now suppose that 
in addition to the moving camera there is an object moving independently. Then both 
the object and the background scene each provide two Kruppa equations from their 
associated fundamental matrix. This means that there are four, in general, independent 
equations available on the internal parameters. Then, for instance, both the focal lengths 
and the principal point (assumed fixed but unknown) may be computed if the aspect ratio 
and skew are known. 

II. Four views: Suppose that the background scene and object are tracked through four 
or more views so that camera projection matrices can be computed for each. Then (as will 
be demonstrated here) there are sufficient constraints to solve for the (scaled) Euclidean 
motion of the camera and object and background scene and internal parameters, with 
the camera having different internal parameters in each view. 

Unlike previous approaches to auto-calibration, in the case of multiple motions with 
a sufficient number of views it is not necessary to make any assumptions about the 
internal parameters (e.g. that they are fixed) or camera motion - though of course such 
information may be included if available. 

The advantages of multibody 3D reconstruction are most vividly demonstrated in 
cases such as those above, where static-scene reconstruction cannot give an unambiguous 
solution. However, even in situations where the static-scene formulation is applicable, we 
show that multibody reconstruction yields better results in the presence of measurement 
error. In the simplest instance, this is because more measurements contribute to the 
estimate of camera parameters, so that the estimation of the entire reconstruction is 
improved. In the large, however, we show too that when the motion estimate is poorly 
constrained by each of the individual objects, the composite may be better than any 
component, by virtue of the independent contributions of each moving object. 

1.1 Background 

Previous work on multiple motions has concentrated on the problem of motion segmen- 
tation. For example, independent motions can be segmented and clustered using robust 
estimation techniques , which successively search for the most consistent sta- 

tic scene correspondences. Throughout this paper, the motion segmentation problem 
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is assumed solved — and indeed motion segmentations herein have all been performed 
automatically using the standard principles exemplified in Where authors have 
extracted motion after segmentation Rim . the motions have been estimated indepen- 
dently, or have depended on calibrated cameras. Costeira and Kanade Q do not assume 
prior segmentation, and provide a factorization method which recovers independent mo- 
tions, and provides a motion segmentation, when the imaging conditions allow the weak 
perspective assumption. Again, however, the coupling provided by the shared camera 
constraint is not present, so no accuracy can be gained over independent static-scene 
reconstructions. 

Avidan and Shashua have recently addressed a problem that they call “trajectory 
triangulation”: the recovery of the position of a point moving on an unknown straight 
line Q or an unknown conic where the point is seen by a single camera. This is 
a more general problem than classical triangulation because the point is not observed 
simultaneously in the views. In their case the camera motion (matrix) is known a-priori 
and the point motion restricted. This makes the problem quite different to that studied 
here. 



1.2 Camera Models 

Depending on the source of one’s imagery, one of a number of alternative camera models 
will pertain. The models differ in the number of camera parameters which are known 
a-priori, the number that are constant through the image sequence, and the number that 
are varying through the sequence, and so must be estimated separately for every frame. 

Calibrated: In the calibrated case, all camera parameters are known beforehand, and 
the task of 3D reconstruction is to determine the exterior orientation of the camera. 

Fixed: In the model considered by much early research into uncalibrated structure and 
motion, the camera parameters are unknown, but assumed fixed fhroughouf fhe sequence. 
In practice for video sequences from a cam-corder, aspect ratio and skew are very often 
fixed (and can be known a-priori). 

Free: It is not uncommon for a real camera to vary three or more of its internal parameters 
in every view. For example, a “pan and scan” conversion from widescreen to video 
format will continually change principal point and occasionally aspect ratio, while focal 
length is being varied in the original frame. A collection of scanned photographs will 
typically have arbitrary principal points and focal lengths. Even for a camera which 
is ostensibly varying zoom only, mechanical misalignments will invariably induce a 
peripatetic principal point OH . In these situations, static-scene 3D reconstruction cannot 
recover the internal parameters without scene knowledge. 

Nonlinear lens distortion: In these models, and in this paper, we ignore the issue of 
correction for radial lens distortion, assuming that a linear projection approximation is 
sufficiently accurate to initialize a bundle adjustment which does include radial correction 
terms. 
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1.3 Notation 

A standard pinhole camera model is employed, and points are represented in homoge- 
neous coordinates. The 2D projection x of a 3D point X is given by 



where for homogeneous quantities = indicates collinearity of vectors, or equality up to 
a scale factor. The 3x3 rotation matrix R and translation t represent the position of the 
camera in space. Alternatively, we can consider the camera to be hxed in position, so 
that each object’s position is specified by a rotation and translation with respect to the 
first view. 

The image of the absolute conic (lAC) is given by w = (KK^ ) “ ^ , and its dual (DIAC) 
is U3* = KK^. 

The image data is a sequence of v views of the scene containing b rigid bodies 
moving independently. In each view v, the camera parameters are given by K^,. The 
projection equation for point p of body j3 in view v is 



The objective of this work is to recover the structure and motion parameters which best 
model the observed image data { ^vp}- 

2 Variations on Multiple Motions 

In the following sections, we present a number of multiple-motion scenarios and show 
how the multibody analysis provides solutions which are not attainable in the static- 
scene formulation. The scenarios considered are chosen in order to illustrate the various 
types of advantage that multibody analysis confers. They are: 

12. 1 1 In the most general case, there are several independently moving bodies, and each 
body j3 contains sufficient points to enable a consistent set of camera matrices 
to be computed throughout the sequence. This requires that at least 7 points are 
available (in general position) on each object for 2 views, and at least 6 points for 3 
or more views. 

12.21 At the other extreme are special cases of: scene structure, object motion, or imaging 
conditions. For example, the scene may consist of several independently moving 
planes, or the object might move under planar motion (fixed rotation axis direction 
and translation perpendicular to this direction), or the imaging conditions might be 
affine. 

lO Between these general and specialized extremes are objects which offer only shal- 
low relief, and therefore poorly constrain the camera matrix estimation. However, 
by simultaneously fitting to several motions, the spaces of degeneracy intersect, 
constraining correctly the final computed motion. 



f S Uq 

X = K[R I t]X ; K = 0 af Vo 

0 0 1 




^x„p = K„[%| %% 
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ESI At the other end of the gamut from EB we have the situation where very few 
points can be tracked on each object - insufficient to compute camera matrices, but 
the objects are not planar. We show that four or more points on a moving object 
constrain the camera calibration and Euclidean structure and motion recovery when 
tracked over multiple views. 

These scenarios are now developed in detail. 

2.1 Several High-Relief Bodies 

In this scenario, we are tracking b independently moving bodies. We will start with the 
special case of two views. A fundamental matrix is computed for each body /3. From 
the general formula for the fundamental matrix H3] 

F= [e2]^K2RKjji 

cancel RR^ = I in (FKi)(FKi)^ to obtain the 2-camera Kruppa I29II equations 

[®2]x‘^2[e2]x = FtujF^ 

where 02 is the epipole, given by ejF = 0. This homogeneous equation is linear in the 
elements of the DIACs ou* and UJ 2 , and generates two independent quadratic constraints 
relating the elements of and a; 2 removing the scale factor. Thus it provides two 
independent constraints on the 10 degrees of freedom (dof) of the two lACs. From just 
two views then, each independent motion provides 2 constraints on the two cameras. 
This leads to 

Result 1 (Calibration from multiple objects: 5 objects, 2 views). If two cameras 
view a scene containing five or more independently moving objects, the cameras for the 
two views can be completely calibrated. 

Multiple views: v > 2 It is assumed that each independently moving body supports 
the computation of a sequence of P matrices. We will now extend the counting argu- 
ments and computation methods that have been developed in the recent auto-calibration 
literature iir^nxr'Ti for single object motion. 

Counting argument. We choose to picture the camera as static, with objects moving 
relative to the camera. For each object the unknowns are the Euclidean motion up to 
an overall scale. Thus for v views there are 6v — 7 unknown parameters. However, in 
v views there are llv — 15 constraints available from the P matrices. Thus the number 
of available constraints that may be applied to calibration (in principle) is 5v — 8. This 
results in 2 constraints from the fundamental matrix (the Kruppa equations), 7 from the 
trifocal tensor, and 12 constraints from the quadrifocal tensor. 

Now suppose there are b objects. Since the objects are imaged by the same camera 
their calibration constraints (assumed independent) may be combined. Consequently, 
there are then b(5v — 8) constraints available on a maximum of 5v unknowns (5 ca- 
libration parameters per view). Thus provided b(5v — 8) > 5v, the calibration can be 
determined in principle. This counting argument leads to the following result: 
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Result 2 (Calibration from multiple objects: 2 objects, 4 views). Given four or more 
views of two moving objects the camera for each view can be completely calibrated. 

Algebra. Now we develop the multiple-view constraints algebraically. We let 

= [%\ 

denote the camera matrix of object f) in the ith view. For the first view we assume that 
%’i = [l I O] for all /3. This fixes the reconstruction frame of each object to coincide with 
the frame of the first camera, an idea used by Sinclair l2lll . For each object a Euclidean 
reconstruction (up to an overall scale) is then obtained from the projective reconstruction 
as 



= I %] i = 2,..,v 



( 2 ) 



where % is the 4x4 homography 



% = 



Ki 0 
1 



(3) 



which must be determined. Note that is common across objects, only the plane at 
infinity, specified by need be determined independently for each object in the pro- 
jective frame specified by { ^i}. Thus in order to determine the calibration matrix for 
each view (and the Euclidean motion for each object) it is only necessary to solve for 
5 -F 3v unknowns; namely the 5 internal parameters of the first camera, Ki, and 3 para- 
meters, to specify the plane at infinity for each object (and thereby transform all the 
reconstructions to a common metric frame). 

As is standard in the auto-calibration literature for a single motion, the rotation 
matrices can be eliminated to obtain equations relating only the unknowns. Applying 
the rectifying transformation H given by 0, the first 3x3 submatrix of the P matrix 
in o gives 

Ki % = %Ki + ^Ui i = 2, .., m (4) 



which may be rearranged as 

% = {%Ki+ (%- ^ai^p^)Ki i = 2,..,m (5) 

where ^ = — Kj"^ The rotation Ri may be eliminated on the right or the left, leaving 

K,kJ = {\- KiK7 ( V ) (DIAC) 

(k,K^)”' = (% - (KiK7)”^ - ^a, V)"^ (lAC) (6) 

which involve only the unknowns Ki , and the known elements from the projective re- 
construction ^Ai, ^ai. The first equation (DIAC) is alternatively obtained by multiplying 
out the projection of the absolute (dual) quadric ED. 

It is at this point that we break with the conventional route of auto-calibration, because 
in the multi-body case new equations can be formed. In particular, because Ki is common 
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to all objects, each additional object generates five independent equations of the form 
(e.g. from the DIAC equation) 

o;* = (%- % “pV) = (^A,- 

(7) 

for two objects a and /3. In the case of two objects in v views there are then 5(v — 1) 
independent equations on 1 1 unknowns, so the problem may be solved provided 5v — 
16 > 0, consistent with the counting argument above. Or, stated another way, each 
object provides 5 independent constraints for each view apart from the first, at a “start 
up cost” of only 3 parameters per object (namely ^). Alternatively, equations may be 
obtained involving only the unknowns These are the analogue of the “Modulus 
Constraint” ||l 9j|. 

If additional information is available, e.g. a fixed or skew zero camera, then additional 
equations are generated from (EJ, in the same manner as conventional single body auto- 
calibration. 



2.2 Special Cases of Scenes and Camera Motions 

Multiple independently moving scene planes. It is assumed that each plane P induces 
a homography % between each view. Triggs has shown that for a single moving 
plane the homographies impose two constraints per view, after a “start up cost” of four 
has been paid. Geometrically the two constraints per view arise from the imaged circular 
points of the plane which lie on the lAC. These are transferred between views by H. The 
start up cost arises becauses initially the imaged circular points are unknown, and their 
four degrees of freedom must be specified. 

Consequently, a fixed camera may be calibrated once v > (5 + 4)/2, i.e. 5 views. 
Now consider a camera in which two (or more) internal parameters vary between views. 
If two parameters vary, then the number of unknowns is 2v + 7 (the 7 results from three 
fixed camera parameters and the start up cost of 4 for the circular points). There are a 
total of 2v constraints, but this is never sufficient to determine the 2v + 7 unknowns. 
In the happy circumstance that 2 independently moving scene planes are available the 
accounting changes to four constraints per view plus a start up cost of eight, and the 
number of varying parameters may be increased to three (since the number of unknowns 
is 3v for the varying parameters, and 2 for the fixed parameters and together with the 
fixed cost this totals 10 + 3v, while the number of constraints is 4v). We have then for 
example. 

Result 3 (Calibration from multiple moving planes). Given two or more indepen- 
dently moving planes, a camera with varying zoom and principal point may be calibrated 
from image data alone. A minimum of 10 views are required. 

If the camera can be calibrated then the motion is also determined (up to at most a finite 
ambiguity). If there are 3 planes, then all the internal parameters may vary provided 
there are at least 12 views (in general we require 2bv > 5v + 4b for b planes). 
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Multiple Planar motions. It is assumed that each object is undergoing planar motion 
on a common plane, and that each object supports the computation of a consistent set 
of camera matrices. A common example would be a camera observing several cars at 
a road junction. In the case of a single object and fixed calibration Armstrong et al. 
showed that from 3 or more views the plane at infinity is uniquely determined, and the 
internal parameters are determined up to a one parameter family. 

If a scene contains two or more such degenerate motions, then each provides con- 
straints on the plane at infinity and camera calibration. However, the one parameter 
ambiguity is not removed unless constraints are provided by an object not undergoing 
planar motion. This quickly gives us several results of the form 

Result 4 (Disambiguation of planar motion). A camera undergoing planar motion 
can be uniquely calibrated if another moving object (which may be a special structure 
such as a plane) is viewed undergoing independent, but general, motion. 

2.3 Low-Relief Objects 

Often objects have low surface relief compared to their distance from the camera centre. 
If cameras are computed from such objects then poor estimates of the P matrices are 
obtained. 

The general situation then is that each independent motion produces a probability 
density function over the space of possible 3D reconstructions. As shown in section ani 
each body (3 is represented by a set of camera and scene parameters 9/3 which maximizes 
a likelihood function P{0p). 

Typically these pdfs will not have a sharply-defined mode, but will instead define a 
space of nearly equivalent reconstructions. A common example occurs when the camera 
motion is towards a low-relief scene, and the camera focal length is unknown. Conflation 
of focal length and distance to the scene means that focal length may vary over a wide 
range and still give a reconstruction which explains the image data. However, when there 
are multiple motions, their ambiguities will not in general coincide, so that combining 
the estimates will reduce the ambiguity. FigureQ illustrates this process, and section rOI 
describes the implementation. 



^amera parameters, Motion 1 . 

Camera parameters, Multibody 

Camera parameters. Motion 2. 




Each motion produces a poorly constrained re- 
construction, meaning that camera parameters 
can he varied over a wide range (stroked el- 
lipses) while generating small 2D reprojection 
errors. However, because the ranges are dif- 
ferent for each independent motion, the multi- 
body estimate (hatched ellipse) “intersects” the 
covariances to give an accurate result. 



Fig. 2. Combining pdfs of multiple independent motions under the constraint that they share the 



same camera. 
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2.4 Few Points 

The final scenario explored here concerns the case when few points are available on 
each moving object. Suppose we can observe b bodies, each of p points, in v views. The 
measurements are the (a;, y) position of the projection of each point in each view, so 
the number of measurements is 2vbp. The unknowns consist of 3 degrees of freedom 
for each 3D point (to represent its (x, y, z) in the camera’s coordinate frame), and 6 
dof per view for each object to parametrize its Euclidean motion up to an unknown 
similarity transformation, giving 3bp + b(6v — 7) dof in total. Consequently there are 
2vbp constraints on 3bp + b(6v — 7) unknowns, and provided 

2vbp > B (3p + 6v — 7) (8) 

Objects + Motions 

then there are constraints available for calibration. For 3 points per body, this reduces to 
6bv > b(6v + 2) so that adding views or bodies just provides as many measurements 
as degrees of freedom. However, for four points we obtain 2bv > 6b or 2b(v — 3) 
constraints on the camera parameters. In particular if b = 1, then there are 2v — 5 
constraints available for camera calibration. This means that after a “start up cost” of 
five, there are two constraints per view available for calibration. Thus a fixed camera 
can be determined after 5 views, and a camera with one parameter varying determined 
once 2v — 5 > 4 + v, i.e. nine views. Thus we have 

Result 5 (Structure and motion from four points). Given four unknown rigidly 
coupled points undergoing Euclidean motion, observed by an unknown camera; it is 
possible to recover their structure, motion, and the camera calibration (even in the 
presence of a single varying intrinsic parameter), from image data alone. 

This result also applies in the case of four coplanar points since (as in section 
structure and motion can be recovered from four coplanar points. 



3 Implementation 

The previous sections have provided the theoretical background to multibody recon- 
struction, without reference to implementation. In this section we briefly discuss these 
implementation issues, and in particular the cost function for bundle adjustment in the 
case of multibody motion, and the resulting covariances. 

3.1 Point Detection, Tracking, and Segmentation 

For this work, we employ standard point detection and tracking procedures, for example 
as described in |25]: interest points are detected using the Harris operator III II . and 
matched pairwise using a RANSAC procedure to compute the fundamental matrix, 
followed by triplet matching using a RANSAC procedure to compute the trifocal tensor. 
Suppose the sequence consists of a moving camera viewing a background scene and 
an independently moving foreground object; then the RANSAC scoring mechanism 
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will first select the dominant motion (say this corresponds to the background motion). 
Merging pairwise and triplet tracks gives multi-view 2D tracks for each 3D point of the 
background. Tracks corresponding to this dominant motion are then removed, and the 
procedure repeated to track the foreground object. This track removal and re-RANSAC 
loop can be repeated until too few matches remain to reliably compute the between-view 
relations. 

3.2 Bundle Adjustment 

A component of the implementation is the ability to quickly compute maximum like- 
lihood estimates of cameras and structure via bundle adjustment. The description of the 
process is relatively simple: For v views of p 3D points, we wish to estimate camera 
calibrations K„, Euclidean transformations [R„ | t„] and 3D points Xp which project 
exactly to image points Xyp — P„Xp = K.„[R | t]Xp. The projection matrices and 3D 
points which we seek are those that minimize the image distance between the reprojected 
point and detected (measured) image points x„p for every view in which the 3D point 
appears, i.e. 

min e = d^(xyp, K^[R^ | t„] Xp) (9) 

K„,R„,t„,x„ r — ' 

P VGVIEWS pGPOINT INDICES 

where d(x, y) is the Euclidean image distance between the homogeneous 2D points x 
and y. If the image error is Gaussian then bundle adjustment is the MLE. This error may 
be efficiently minimized following 111 3l . 

For multiple objects, the bookkeeping is a little more complex, but the principle 
is analogous. The camera matrices are common between views, but each object has a 
separate set of Euclidean transformations. Thus the error function for two objects has 
the form 

e = e(Ki..v, %..v, %..y, Xi..p) (10) 

= K„[X I “t„]Xp)-F d\x,p, K,[%„| '^t„]Xp) 

i;GVIEWS \pS POINT INDICES a pGPOINT INDICES /3 j 



3.3 Covariance Intersection 

Where the motions in the scene provide reasonably well constrained camera parameters, 
we wish to combine the parameter estimates in a way that makes use of the uncertainty 
reduction illustrated in Figure 0 If the camera parameters are so constrained, an appro- 
ximate covariance matrix of each set of parameters at the static-scene estimate may be 
computed from the Jacobian of the reprojection error and will provide a reasonable 
model of the probability distribution PiO) of the possible camera parameters. 

From bundle adjustment, then, we compute the covariance estimate Ap at the mini- 
mum 6^ for each body /?, so that 

-\ogpp{e)^{e-e^Y A-p\e-e^) 
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Then a combined estimate of the camera parameters is obtained m . For example, given 
two objects with camera parameters 6 a and 9 b, and covariances A a and As, the 
combined parameters are given by 

9 = Ab{Aa + Ab)~^9a + Aa{Aa + Ab)~^9b 

In the multibody bundle adjustment this covariance intersection is achieved automa- 
tically. 



4 Experiments 

We now test the theory on several synthetic and real examples. The experiments are 
designed to verify the theoretical results, and to determine the level of improvement 
provided over static-scene techniques. 




Fig. 3. Experiment 1 . Synthetic objects. The shaded plots (a) illustrate the shapes of the 3D objects. 
Data used in the experiments are just the 3D vertex coordinates and synthetic camera positions, 
typical examples of which are shown in the wireframe plots (b). 



4.1 High-Relief Bodies, Freely Varying Camera Parameters 

The first experiment is simply a test of the theory, in a case where static-scene recon- 
struction cannot compute Euclidean structure and motion. Two synthetic objects were 
created, a unit barrel and a “peaks” function, illustrated in Figure^] A synthetic sequence 
of 25 images was generated. In each frame, the 3D points of each object had an indepen- 
dent Euclidean transformation applied, with typical camera motions as illustrated. Also 
at each frame, all camera parameters were randomly chosen, so that no parameter was 
constant through the sequence. The 3D points were projected to generate 2D tracks. 

From the 2D tracks, a projective reconstruction { ^v}v=i computed — in an ar- 

bitrary projective frame — foreachbody. Inastatic scene, such a projective reconstruction 
is the best that can be achieved. In the multibody case, the DIAC constraint o provides 
a way to compute a Euclidean upgrade. Given Euclideanizing homographies 0 for each 
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Fig. 4. Experiment 2. Office sequence, (a) Above: First frame of 90, with foreground object tracks 
superimposed. Below: Last frame, background object tracks superimposed, (b) Results of focal 
length estimation. Without multibody estimation, the foreground object has significant errors in 
focal length. 



Structure, we can compare the DIACs of o using the distance measure of HI: 
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The Euclideanizing homographies are parametrized in terms of Ki , Vq , v^, and a non- 
linear minimization of the above distance measure was implemented. Initializing the 
minimization from various sites around the solution converged, as expected, to the true 
solution. Note that in this experiment, a static-scene reconstruction converges to a dif- 
ferent (and erroneous) solution as the problem is underconstrained. 



4.2 Office Sequence 

Figure El shows a camcorder sequence of two objects: a static office background and 
a rotating cake tin. The camera was moved through the scene, without zooming, but 
with autofocus enabled. A projective reconstruction was computed for each object SI. 
Then the reconstruction was used to initialize a bundle adjustment in which the focal 
length was allowed to vary. This provides a measure of reconstruction accuracy, as the 
recovered focal lengths should all be equal. 

The graph shows the focal lengths computed through the sequence for (a) the back- 
ground alone, (b) the tin alone, and (c) both simultaneously. In this case, the background 
structure has good relief, and the foreground object is quite shallow, so the combined 
estimate is very similar to the background. The foreground object, on the other hand, 
enjoys a greatly improved calibration estimate. With only single-body estimation there 
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Fig. 5. Experiment 3. Weak relief, (a) Frames from the synthetic sequence, (b) Recovered focal 
lengths, plotted against frame number, (c) RMS difference between Af and nominal value. The 
multibody estimate is closer to the ground truth and has lower variance than either of the single- 
object results. 



are several equivalent reconstructions of the foreground points, corresponding to diffe- 
rent choices of focal lengths. Without some other knowledge, such as that provided by 
multibody estimation, there is no way to choose the correct one. 



4.3 Weak Relief 

In this synthetic experiment, we consider a case where a Euclidean reconstruction is 
possible for each moving object, but the solution is poorly conditioned. In this case, we 
expect that the combined estimation will improve both the precision and veridicality of 
the recovered reconstruction. 

In this case, camera zoom was varied linearly, and the principal point was varied 
smoothly in a curve. Object motions were as in section R~T1 but the objects were scaled 
along one axis by a factor of 10 in order to reduce the relief. The 3D points were projected 
onto a synthetic image plane of size 768 x 576, and the 2D positions perturbed by noise 
vectors drawn from a 2D Gaussian distribution of mean zero and isotropic standard 
deviation a — 0.7 pixels. A conventional single-scene reconstruction was computed for 
each object, finding the maximum likelihood estimate © via bundle adjustment 
A multibody estimate was computed by bundle adjustment, using the averaged camera 
parameters ( tl3.3l) as a starting value. The camera focal lengths were extracted, and are 
shown in hgure 0 As the focal length / is increasing linearly, the difference in / from 
frame to frame should be constant. The RMS difference between the algorithms’ reported 
/ and the nominal value was computed for each method, and the results are tabulated 
in figure 0 The multibody solution is, as expected, better than either of the static-scene 
estimates. (Simply averaging the focal lengths produces a worse RMS, obviously). The 
amount of improvement — RMS error reduces by a factor of 0.67 — may appear small, but 
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Frame 0, Object A tracks Frame 20, Object B tracks 



Height vs. Frame number 



Fig. 6 . Experiment 4. Planes. Camera position fixed, object A is rotating, motion B arises due to 
zoom. Motion segmentation is automatic. The graph plots the z coordinate of the computed optical 
centre (in the coordinate frame where the windmill is the plane z = 0) against frame number, 
which should be constant through the reconstruction. Because the multibody estimate gains more 
accurate calibration parameters from the second motion, it has smaller standard deviation than the 
static-scene reconstruction using object A alone. 



it is consistent with the use of twice as much data, which would imply an improvement 
of 1/^2 or 0.71. 

4.4 Planes 

The fourth experiment uses a real sequence, some frames of which are shown in Figure^l 
The camera is stationary, but zooming. The scene motion is automatically segmented 
into two components: the rotation of the windmill blades, and the homography 
induced by the zooming camera. Initial estimates of the camera parameters are obtained 
from the rotating plane. In this instance, not all points are tracked through all views, 
so there is missing data, but the bundle adjustment is easily modified to cope with this 
situation. 

Figure ^ shows estimates of camera height (in windmill diameters) relative to the 
plane of the windmill before and after incorporation of multibody tracks. The height 
should be constant, so its standard deviation is a measure of the quality of the recon- 
struction. On this sequence, multibody analysis improves the standard deviation from 
0.42 to 0.19, again demonstrating the advantage of considering multiple motions. 

5 Discussion and Open Questions 

We have extended the recovery of structure and motion to the case of sequences compri- 
sing multiple motions. We have presented an extension to static-scene bundle adjustment, 
which allows multiple motions to contribute to the estimation of camera parameters. We 
have shown that multibody reconstruction allows the analysis of scenes which cannot be 
handled by static-scene structure from motions, and that multibody analysis improves 
the accuracy of reconstructions even when standard techniques do apply. The paper also 
contributes the new result that structure and motion can be determined from just four 
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points in general position, even with one varying camera parameter - a result which can 
find applications in tracking through long sequences. 

Although we have demonstrated in principle that certain problems can be solved, 
the solutions have been presented as nonlinear minimizations. It remains of interest to 
elucidate more elegant closed-form solutions to these problems, in particular: 

1 . Obtain the principal point and two focal lengths for the two view example of the 
introduction, based on the Krupa equations; 

2. Determine a fixed calibration matrix and motion from four points imaged in suffi- 
ciently many views. 

3. Obtain affine camera specializations for the scenarios and results of sectional 
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Abstract. The fnndamental matrix defines a nonlinear 3D variety in 
the joint image space of multiple projective (or “uncalibrated perspec- 
tive”) images. We show that, in the case of two images, this variety 
is a 4D cone whose vertex is the joint epipole (namely the 4D point 
obtained by stacking the two epipoles in the two images). Affine (or 
“para-perspective”) projection approximates this nonlinear variety with 
a linear subspace, both in two views and in multiple views. We also show 
that the tangent to the projective joint image at any point on that image 
is obtained by using local affine projection approximations around the 
corresponding 3D point. We use these observations to develop a new ap- 
proach for recovering multiview geometry by integrating multiple local 
affine joint images into the global projective joint image. Given multiple 
projective images, the tangents to the projective joint image are com- 
puted using local affine approximations for multiple image patches. The 
affine parameters from different patches are combined to obtain the epi- 
polar geometry of pairs of projective images. We describe two algorithms 
for this purpose, including one that directly recovers the image epipoles 
without recovering the fundamental matrix as an intermediate step. 



1 Introduction 

The fundamental matrix defines a nonlinear 3D varietjfl in the joint image space, 
which is the 4-dimensional space of concatenated image coordinates of corre- 
sponding points in two perspective images. Each 3D scene point X = (A, Y, Z) 
induces a pair of matching image points (x, y, x' , y') in the two images, and this 
stacked vector of corresponding points is a point in the joint image space. The 
locus of all such points forms the joint image for the two cameras. Since there 
is a one-to-one correspondence between the 3D world and the joint image, the 
joint image forms a 3-dimensional variety in the joint image space. Every pair of 
cameras defines such a variety, which is parametrized by the fundamental matrix 
which relates the two cameras. 

The idea of the joint image space has been previously used by a few resear- 
chers - notably, by Triggs HH who provided an extensive analysis of multi- view 
matching constraints for projective cameras in the joint image space, and by 
Shapiro m who analyzed the joint image of 2 affine (“para-perspective” projec- 
tion) camera images. Triggs also observed that for multiple (say m > 2) views 

^ namely, a locus of points defined by a set of polynomial constraints, in this case the 
epipolar constraint, which is quadratic in the joint image space. 
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Fig. 1. Illustration of the 4D cone in the joint image space. The two perspective images 
view a 3D model of the house. The fundamental matrix forms a point-cone in the joint 
image space, whose axes are defined by x,y,x' and y' . The vertex of the cone is the 
joint epipole (namely the 4D point formed by stacking both epipoles). Each patch (i.e, 
the window or the door) in the 2D images is approximated by an affine projection 
which corresponds to a tangent hyperplane in the joint image space. 



the “projective” joint image for any number of cameras is still a 3-dimensional 
submanifold of the 2m dimensional joint image space. This manifold is parame- 
trized (up to a set of gauge invariances) by the m camera matrices. 

Affine (or “para-perspective”) projection approximates the nonlinear variety 
with a linear subspace. In multiple views, the affine joint image is a 3-dimensional 
linear space in the 2m dimensional joint image space. This insight has led to 
the factorization approach, which simultaneously uses correspondence data from 
multiple views to optimally recover camera motion and scene structure mu 
from para-perspective images. 

The non-linearity of the projective joint image makes the multi- view pro- 
jective structure from motion problem harder to solve. A natural question is 
whether the affine approximation could be used to benefit the projective case. 
This question has been previously explored by a few researchers. For exam- 
ple, use the affine model globally over the entire image to bootstrap the 

projective recovery. On the other hand, Lawn and Cipolla PHI use the affine ap- 
proximations of over local regions of an image. They combine the affine parallax 
displacement vectors across two frames from multiple such regions in order to 
obtain the perspective epipole. 

In this paper, we use the joint image space to show the intimate relationship 
between the two models and exploit it to develop a new algorithm for the recovery 
of the fundamental matrix and the epipoles. We establish the following results: 

1. The joint image of two projective views of a 3D scene is a point c.nne\\ 2\ in 
the 4-dimensional joint image space. See figure [D 

2. The tangent to the projective joint image is the same as the linear space 
formed by the joint image obtained by using an affine projection approxi- 
mation around the corresponding 3D scene point. This is true both in 2 and 
in multiple views. 
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The process of recovering the fundamental matrix is thus equivalent to fitting 
a 4D cone. For example, the 8-point algorithm |5| for recovering the fundamental 
matrix can be viewed as fitting a 4D cone to 8 points in the joint image space, 
since every pair of matching points in the 2D images gives rise to a single data 
point in the joint image space. Any technique for recovering the fundamental 
matrix can be regarded this way. 

Alternatively, the projective joint image can also be recovered by using the 
tangents to it at different points. In our case, a tangent corresponds to using a 
local para-perspective (or “affine” ) projection approximation for an image patch 
around the corresponding 3D scene point. This leads to a practical two-stage 
algorithm for recovering the fundamental matrix, which is a global projective 
constraint, using multiple local affine constraints. 

1. The first stage of our algorithm simultaneously uses multiple (m > 2) images 
to recover the 3-dimensional affine tangent image in the 2m dimensional 
joint image space. This can be done by using a factorization or “direct” type 
method. 

2. In the second stage the two- view epipolar geometry between a reference 
image and each of the other images is independently recovered. This is done 
by fitting the 4D cone to the tangents recovered in Stage I from multiple 
image patches. We take advantage of the fact that all the tangents to the 
cone intersect at its vertex - the joint epipole - to compute it directly from 
the tangents. Thus, local affine measurements are used to directly estimate 
the epipoles without recovering the fundamental matrix as an intermediate 
step. 

It is worth noting that this approach to directly recover the epipoles is a 
generalization of the aforementioned work by Lawn & Cipolla f/lS) . as well as an 
algorithm by Rieger & Lawton m for computing the focus of expansion for a 
moving image sequence from parallax motion around depth discontinuities. We 
postpone more detailed comparison and contrast of our work with these previous 
papers to Section El since we believe that a clear understanding our method will 
be useful in appreciating these relationships. 



2 The AfRne and Projective Joint Images 

This section establishes the tangency relationship between projective and affine 
joint images and shows that the projective joint image of two images is a 4D 
cone. Our derivations proceed in 3 stages: (i) We start by showing that the 
affine projection is a linear approximation to the projective case. We use the 
affine projection equations to derive the affine motion equations in 2 frames and 
the associated affine fundamental matrix. These results are already known (e.g, 
see lam) but they serve to lay the ground for the remaining derivations in the 
paper, (ii) Next we show that for two (uncalibrated) perspective views the joint 
image is a 4D cone, (iii) Finally we show that the hyperplane described by the 
affine fundamental matrix is tangent to this 4D cone. 
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2.1 The Projective Joint Image 

We use the following notational conventions, x = (a;, j/)^ denotes a 2D image 
point, while p = (x,?/, 1 )^ denotes the same point in homogeneous coordinates. 
Likewise X = {X, Y, ZY' denotes a 3D scene point, and P = {X, Y, Z, 1)^ deno- 
tes its homogeneous counterpart. The general uncalibrated projective camera is 
represented by the projection equation: 



p ^ MP = HX + t, 



where M denotes the 3x4 projection matrix, H (referred to as the “homography” 
matrix) is the left 3x3 submatrix of M, and the 3x1 vector t is its last column, 
which represents the translation between the camera and the world coordinate 



systems. 

Since our formulation of the joint-image space involves stacking the in- 
homogeneous coordinates x from multiple views into a single long vector, it 
is more convenient to describe the projection equations in in-homogeneous co- 
ordinates0. The projection of a 3D point X on to the 2D image point x* in the 



f-th image is given by: 

/Hix+ii \ 
i ^ H> X+t< 

1 H^X+t- I 

\ Hj,X+t^ ) 



( 1 ) 



where H^, H 2 and H 3 are the three rows of H\ the homography matrix of the 
*-th image. Likewise {t\, denote the three components of t® the translation 
for the i-th image. 

Consider the stacked vector of image coordinates from the m images - namely, 
the 2m dimensional joint-image vector {x^ ■ We see from 

Equation n that each component of this vector is a non-linear function of the 3D 
position vector X of a scene point. Hence, the locus of all such points forms a 3- 
dimensional submanifold in the 2m dimensional space. This defines the projective 
joint image. (We have chosen to call it “projective” only to indicate that the joint 
image of multiple perspectively projected views of a 3D scene do not require any 
knowledge or assumptions regarding calibration.) 



2.2 The AfRne Joint Image 



Around some point Xq on the object we can rewrite the x— component of Equa- 
tion [H as 

HiXo + ti + HiAX 

HaXo + ta + HsAX ^ > 



where AX = X — Xq. {Note that we have dropped the super-script i for ease of 
readability.) Let us denote Zq = H 3 X 0 -I- t^ and AZ = H 3 AX. We divide the 
numerator and denominator by Zq to obtain 



H] 



X = 



Xp+ti 

_^o 



TT AX 



1 + 4^ 



( 3 ) 



In this regard, our formulation is slightly different from the more general projective 
space treatment of HSI. Our approach turns out to be more convenient for deriving 
the affine approximations. 
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Considering a shallow portion of the 3D scene around Xq, i.e, e = AZ jZ^ <C 
1 , we can use the approximation 1/(1 + e) k, 1 — e to obtain 



+ — )(1-— ) (4) 

Expanding this equation, replacing AZ with and neglecting second- 

order terms AX^ gives us the first-order Taylor expansion for the perspective 
projection equation for x: 



X 



Xq -|- Hi 



AX 



HgAX 

- Xo — = 

^0 



( 5 ) 



Performing a similar derivation for y we get the affine projection equations 
that relate a 3D point to its 2D projection: 



X 

y 




1 /Hi-a;oH3\ 
^ Zo 'vH 2 - J/oHs J 



AX 



( 6 ) 



Since these equations are linear in AX, they define a 3-dimensional Zmear variety 
in the 2m dimensional joint-image space. Stacking all such equations for the m 
views gives us the parametric representation of the linear affine joint-image. 
Also, in each of the m images these equations represent the first-order Taylor 
expansion of the perspective projection equations around the image point Xq. 
This means that the 2m dimensional affine joint-image is the tangent to the 
projective joint image at the point represented by the 2m dimensional vector 
(xl 2/o xly^...x^ yjf)^. 



2.3 Two View AfRne Motion Equations 

Next, we want to derive the affine motion equations that relate matching points 
across two images. Such equations have also been previously used by a number of 
researchers (e.g., see |l 119) 1. We present them here in terms of the homography 
matrix H and a matching pair of image points (xq, yo) and (xq, yg) in two views, 
around which the local affine approximation to perspective projection is taken. 

Let us align the world coordinate system with that of the first camera so its 
homography becomes the 3x3 identity matrix and translation is a zero 3- vector. 
Let H, t be the homography matrix and translation, respectively of the second 
camera. We will not present the entire derivation here, but simply note that the 
key step is eliminating AX and AY from equation Let 7 = ^ be the 

“relative” depth of a (x,y) in the reference image. Then, (after some algebraic 
manipulation), we can show that the corresponding point {x' ,y') in the second 
image is given by 

where A is a 2 x 3 affine matrix 

A = (G I x'o-Gxo) 



( 7 ) 

( 8 ) 
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G is a 2 X 2 matrix 



— ^ ~ ^0^31 Hi2 — XqH32 \ 

Z'q \ H21 — y'oHsi H22 — y'oH 32 J 



_ ^ /(Hi-x'oH3)Po\ 

- Z', V(H2-y^,H3)poj • 



( 10 ) 



EquationQcan be interpreted as follows. A defines a 2D affine transformation 
of the image which captures the motion of points for whom AZ = 0, i.e., they 
are on a “frontal” plane at depth Zq in the first image. Off-plane points undergo 
an additional parallax motion yt^. The parallax magnitude 7 is determined by 
the relative depth 7 = The direction of parallax is t^, and is the same for 
all points, i.e, the “affine epipole” is at infinitjEl 

We observe that Equation □( affine motion) is valid for any view relative to 
a reference view. The 2D affine transformation matrix A and the vector vary 
across the images while the local shape 7 varies across all points, but is fixed 
for any given point across all the images. The fact that 7 is constant over mul- 
tiple view enables us to simultaneously recover all the 2D affine transformation 
matrices A, the affine parallax vectors (see Section 3). 



The two- view affine epipolar constraint: The affine motion equations de- 
fined in Equation [7| also imply an affine epipolar constraint El: 

p'^FaP = 0 , 

where: 

/ 0 0 t2 \ / All Ai2 Ai3\ 

Fa = I 0 0 —ti I A21 A22 A23 , (11) 

\-t 2 t 1 0 / \ 0 0 1 J 

is the “affine” fundamental matrix. Note that this matrix is of the form 

0 0 fs\ 

0 0 /4 , 

fl /2 /s / 



Let us denote f = (/i, . . . ,/s)^ = (-^ 2^11 + ^ 1 ^ 21 , -^ 2^12 + < 1 ^ 22 , ^ 2 , -+, 1)^. 
Also let q = {x,y,x' ,y') be the stacked vector of the two pairs of image coor- 
dinates in the 4-dimensional joint image space. The affine epipolar constraint 
says that the affine joint image consists of all points q which lie on a hyperplane 
given by the equation 

( 12 ) 

^ It can also be shown that this vector lies along the direction of the line connecting 
the point Pq to the epipole defined by t in the second perspective image. 
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This implicit form (as opposed to the parametric form described by Equation 0 
will be useful later for us (see SectionEI) to relate to the perspective Fundamental 
matrij!0. (Of course, this approximation is only reasonable for points p and p' 
which lie near the matching pair of points po and p^g.) 



2.4 The Fundamental Matrix as a Point-Cone 



Given two views, the well-known epipolar constraint equation can be written in 
our notation as: 

p'^Fp = 0 (13) 

where p and p' denote the 2D image location of a scene point in two views 
specified in homogeneous coordinates, and F is the 3x3 fundamental matrix. In 
the joint image space, this equation can be written as: 

l)c(^fj=0 (14) 



where as before, q = (x, y, x', y')'^ is the stacked vector of the image coordinates 
of a matching point, and C is the 5x5 matrix defined below. 



0 


0 


Fii 


F 21 


F 31 


0 


0 


F\2 


F 22 


F 32 




Fi2 


0 


0 


F\3 


F 21 


F 22 


0 


0 


F 23 


F 31 


F 32 


F\3 


F 23 


2F33_ 



This equation describes a quadric in the 4 dimensional joint image space of 
{x,y,x',y'). We now analyze the shape of this quadric. 

Theorem : The joint-image corresponding to two uncalibrated perspective views 

of a 3D scene is a point cone. 

Proof : First, we show that the rank of the 5x5 matrix C is 4. To do this, we 

rewrite C as the sum of two matrices Ci and C 2 , where 



Cl 



■0 0 F^f- 
0 0 F|’ 

000^ 

000^ 

.0 0 F|’_ 



(16) 



and 



C2 = Ci^ 



where F^ denotes the i — th row of F^ (equivalently, Fi is the i — th column of 
F) and 0 is a 3D zero vector. Now Ci and C 2 are both Rank 2, since the 3x3 
submatrices contained in them are in fact the fundamental matrix F and F^ 



^ The parameters (/i, . . . , /s) can be derived in terms of H, po and pg. However, these 
expressions are somewhat tedious and do not shed any significant insight into the 
problem. Hence, we have not elaborated them here. 
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which are of Rank 2. Also, it can be easily verified that Ci and C2 are linearly 
independent of each other. Hence C = Ci + C 2 is of Rank 4. 

According to H2| a quadric defined by a 5 x 5 matrix C which is of Rank 4 
represents a 4D cone called a point cone, which is simply a term to describe a 
4-dimensional cone. Since the projective joint image is defined by our C which 
is rank 4, the joint image has the shape of a point cone. QED 

Let e = (ei 62 1)^ and e' = (e^ 62 1) denote the two epipoles in the two 
images in homogeneous coordinates. Let us define the point qe = (ei 62 e\ e '2 1)^ 
in the joint image space as the “joint epipole” . 

Corollary: : The vertex of the projective joint image point cone is the joint 

epipole. 

Proof : From the definition of C is easy to verify that Cqe = Fe -f F^e'. 

But since the epipoles are the null- vectors of the F matrix, we know that Fe = 
F^e' = 0. Hence, Cqe = 0. This means that the joint epipole is the null-vector 
for C, and once again according to nn, this means that the point qe which 
denotes the joint epipole is the vertex of the point cone. QED 

2.5 The Tangent Space of the Projective Joint Image 

As per Equation C3 in the case of 2 views, the projective joint image variety is 
a level set of the function 



/(q) = i(q^ 1)C(?) 

corresponding to level zero. Hence, 



V/ = c('j) 



(17) 



defines its orientation (or the “normal vector”) at any point q. Considering a 
specific point qg, let pg = {x,y,l)^ and pg = be the corresponding 

image point in the two views (in homogeneous coordinates). By looking into the 
components of C, it can be shown that: 



n, 



qo 



V/qo = C 





/ Ffp' \ 


(?) ^ 


Fi’p'g 

FiPo 


F2P0 




\F3P0 -kFl’p'g/ 



(18) 



We have denoted the normal by iTqo- The equation of the tangent hyperplane 
at qo is: 

< o (?)=0 ( 19 ) 

In other words, all points q in the joint-image space that satisfy the above 
equation lie on the tangent hyperplane. Note that the joint epipole qe lies on 
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the tangent plan^ since 

n^o (t ) = (^0 (t ) = (^0 1)^=^ (t) = (20) 

since C is symmetric. We already showed in Section El that the tangent hyper- 
plane to the projective joint image is given by the local affine joint image. Hence, 
the components of Tlq^ given above must be the same as (/i, . . . , f^) defined in 
Section El which are estimated from the local affine patches. This fact will be 
useful in our algorithm described in Section IS.JL 

3 Algorithm Outline 

We use the fact that the local affine projection approximation gives the tangent 
to the 4D cone for recovering the epipolar geometry between a reference view 
and all other views. Our overall algorithm consists of two stages: 

1. Estimate the affine projection approximation for multiple local patches in the 
images. For each patch, use all the images to estimate the affine projection 
parameters. This is equivalent to computing the linear 3D subspace of the 
multiview joint image in the 2m-dimensional joint image space of the m 
input images. This is described in Section 

2. Determine the epipolar geometry between each view and the reference view 
by integrating the tangent information computed for different local affine 
patches in Step 1 above. This is described in Sections |^| and ^3 

We actually present two different methods for Step 2. The first method (see I.S.2II 
samples the joint image around the location of different affine tangents in the 
4-dimensional joint image space to obtain a dense set of two- frame corresponden- 
ces, and then applies the standard 8-point algorithm to recover the fundamental 
matrix between each view and the reference view. 

The second described in Section 14.41 is more novel and interesting and uses 
the fact that all tangent planes to the cone must pass through its vertex (i.e., 
the joint epipole) to directly recover the epipole between each views and the 
reference view without computing the fundamental matrix as an intermediate 
step. 



3.1 Local AfRne Estimation 

Any algorithm for estimating affine projection can be used. For example, each 
affine patch could be analyzed using the factorization method. However, within 
a small patch it is usually difficult to find a significant number of features. 
Hence, we use the “direct multi-frame” estimation by |Sj that attempts to use all 
available brightness variations with the patch. The algorithm takes as input three 
or more images and computes the shape and motion parameters to minimize a 
brightness error. Specifically, let be the affine motion parameters from 

® This is not surprising, since every tangent plane to a cone passes through its vertex, 
which in this case is the joint epipole. 
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the reference image to image j. (Each is a 6- vector corresponding to the 6 
elements of the 2D affine matrix in equation 0 and t ^ corresponds to the vector 
ta in equation fTTill . Also, let 7^ be the “relative depth” of pixel i in the reference 
image. We minimize the following error function: 

E({a,},{t,},{7j) = +/4)2, 

3 i 



where Vlj is the gradient of pixel i at the reference image and It\ is the temporal 
intensity difference of the pixel between frame j and the reference frame 0, and 

u- = - x° = Y,a.j + tj 7 i, 

is the displacement of the pixel i between frame 0 (the reference frame) and 
frame j. The matrix Y has the form: 



/ xylOOON 

\^000a:?/l J 



(21) 



Taking derivatives with respect to aj,tj and 7^ and setting to zero we get: 



= E E Y.VI,(Vlf (Y,a, + t,7,) + Itl) 

^ j 

= E E 7*VI.(Vlf (Y,a, + t,7,) + 

i j 

= ^t,VI,(Vlf (Y,a, + t,7,) + I^) (22) 

3 



Because and 7^ are coupled in VE we take a back-and-forth approach to 
minimizing E. At each step we fix and compute 7* as: 



Iz = 



E,tJVI.(VlfY,a,+/4) 



and minimize the new error function 

= ^^^(VinY.a, +t,7*) + 



(23) 



(24) 



The new parameters SLj,tj are used to recalculate 7* and so on. This entire 
process is applied within the usual coarse-to-fine estimation framework using 
Gaussian pyramids of the images. 



3.2 From Local AfRne Approximation to the Global Fundamental 
Matrix 

In this method, we use the affine motion parameters of all the patches between 
the reference and a given image to recover the projective fundamental matrix 
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between these two images. For a particular image and particular patch we con- 
sider the affine motion parameters a, t, as recovered in the previous subsection, 
to calculate the tangent plane. For increased numerical stability, we uniformly 
sample points on a 3D “slab” tangent to the cone. This is done using the equa- 
tion: 

p' = ap -I- yt (25) 

where p = x,y are uniformly sampled within the patch and 7 takes 

the values —1,0,1. This sampling procedure has the effect of “hallucinating” 
m matching points between the two images, based on the affine parameters of 
the patch. These hallucinated matching points correspond to virtual 3D points 
within a shallow 3D slab of the 3D scene. This is repeated for every patch. We 
use Hartley’s normalization for improved numerical stability and compute the 
fundamental matrix, from the hallucinated points, using the 8-point algorithm. 
Note that since the matching points are hallucinated, there is no need to use 
robust estimation techniques such as LMeDS or RANSAC. 

3.3 Estimating the Joint Epipole Directly from the Local AfRne 
Approximat ions 

While the previous method was useful to recover the global projective F matrix, 
it does not take full advantage of the tangency relationship of the local affine 
joint images to the global projective and joint image. Here we present a second 
method that uses the fact that the tangent hyperplane defined by the local affine 
patches to the 4D cone must pass through the vertex of the cone (see Section ITKll . 
Thus, we can use the affine motion parameters (which specify the tangent plane) 
to recover the epipoles directly, without recovering the fundamental matrix as 
an intermediate step. Let a, t be the affine motion parameters of a given patch 
and let f be the hyperplane normal vector given in (see equation ITU) . Then in 
the joint image space, the tangent TTq,, to the patch is given by: 

Ilqo = f = (tia21 — t2aii,tia22 — ^20121^2) ^1023 ~ ^2013)^ (26) 

where 013 = ((on - l)xo-l-ai2j/o + ai3)) and 023 = («2ia:o + (022 - l)yo + ^23) are 
modified to account for the relative position of the patch in the global coordinate 
system of the image, and {xo,yo) is the upper-left corner of the patch. 

Let Qe = (ei 62 62 1)^ be the joint epipole composed from the two epipole 

then we have that (refer to eniiation I I iSI) : 

( 27 ) 

This equation is true for every patch in the image, thus given several patches we 
can recover the joint epipole Qe by finding the null space of: 
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Once the epipole e' is known we can recover the homography using the fol- 
lowing equation: 

/ 0 1 e'2 \ 

p''^Fp = p'^ 1 0 -e'l Hp = 0 (29) 

V-4e'i 0 J 

In this equation the epipole (ei,e 2 ,l) is known from the step just described 
above. Given p = and p' = are hallucinated matching 

points that are sampled as in the method described in Section El the only 
unknown is the homography H. This equation defines a homogeneous linear 
constraint, 

s'^h = 0 

in the 9 unknown parameters of the homography H. Here s is a 9 x 1 vector 
which depends on the point, and h is the homography parameters stacked as 
a 9-dimensional vector. Every hallucinated matching point provides one such 
constraint. Given a set of N hallucinated matching points indexed by i, the 
vector h must lie on the null space of the matrix formed by stacking the vectors 
sf into a iV X 9 matrix. 

Note that the null space of this equation is a 4-dimensional space as described 
in HD, since any planar homography consistent with the given camera geometry 
and an arbitrary physical plane in 3D will satisfy these equations. Hence, we are 
free to choose any linear combination of the null vectors to form a legitimate 
homography matrix H. 

4 Experiments 

We performed a number of experiments on real images. In all the cases we used 
the progressive scan Ganon ELURA DV cam-corder that produces RGB images 
of size 720 x 480 pixels. In all the experiments we used the “direct multi-frame” 
estimation technique to recover the affine model parameters of a local patch 
across multiple images. We collected several such patches and used them as 
tangents to compute either the fundamental matrix or the epipole directly. 



4.1 Recovering the Fundamental Matrix 

This experiment consists of 6 images. We manually selected 5 patches in the 
first image and recovered the affine motion parameters for each patch for all 
the images in the sequence. We then hallucinated matching points between the 
first and last images and used them to compute the fundamental matrix. To 
measure the quality of our result, we have numerically measured the distance 
of the hallucinated points to the epipolar lines generated by the fundamental 
matrix and found it to be about 1 pixels. The results can be seen in figure El 



4.2 Recovering the Joint Epipole 

We conducted two experiments, one consisting of 6 images, the other consisting 
of 8 images. We manually selected a number of patches in the first image and 
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Fig. 2. First (a) and last (b) images in a 6-frame sequence. The rectangles represent 
the manually selected patches. A “direct multi-frame” algorithm was used to estimate 
the afline motion parameters of every patch throughout the sequence. “Hallucinated” 
matching points were used to compute the fundamental matrix between the reference 
and last image. We show the quality of the fundamental matrix on a number of hand- 
selected points. Note that the epipolar lines pass near the matching points with an 
error of about 1 pixel. 



recovered the affine motion parameters for each patch for all the images in the 
sequence . The patches were used to recover the joint-epipole. From the joint 
epipole we obtained the epipole and used it to recover the 4-dimensional space of 
all possible homographies. We randomly selected a homography from this space, 
and together with the epipole, computed the fundamental matrix between the 
first and last images. The fundamental matrix was only used to generate epipolar 
lines to visualize the result. The results can be seen in figures 0 and 0 

5 Discussion and Summary 

We have shown that the fundamental matrix can be viewed as a point-cone in 
the 4D joint image space. The cone can be recovered from its tangents, that are 
formed by taking the affine (or “para-perspective”) approximation at multiple 
patches in the 2D images. In fact, the tangency relationship between affine and 
projective joint images extend to multiple images. These observations lead to 
a novel algorithm that combine the result of multi-view affine recovery of mul- 
tiple local image patches to recover the global perspective epipolar geometry. 
This leads to a novel method for recovering the epipoles directly from the affine 
patches, without recovering the fundamental matrix as an intermediate step. 

As mentioned earlier, our work generalizes those of Rieger & Lawton HOI and 
Lawn & Cipolla fZIS|. Rieger & Lawton used the observation that the difference 
in image flow of points (namely, parallax) on two sides of a depth discontinuity is 
only affected by camera translation, and hence points to the focus-of-expansion 
(FOE) . They use multiple such parallax vectors tom recover the FOE. Their 
approach has been shown to be consistent with human psychological evidence. 
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Fig. 3. First (a) and last (b) images in a 6-frame sequence. The rectangles represent the 
manually selected patches. A “direct multi-frame” algorithm was used to estimate the 
affine motion parameters of every patch throughout the sequence. The affine parameters 
are used to recover the epipoles directly. The fundamental matrix is recovered from the 
epipoles and an arbitrary legitimate homography only to visualize the epipolar lines. 
We show the quality of the fundamental matrix on a number of hand-selected points. 
Note that the epipolar lines pass near the matching points. 



As mentioned earlier, Lawn and Cipolla use the affine approximation for 
two-frame motion within local regions. However, they do not require that the 
region contain discontinuities. Our algorithm generalizes their approach by using 
multiple views simultaneously. The use of multiple views allows the use of the 
(local) rigidity constraint over all the views. We expect that this will increase 
the robustness of the affine structure from motion recovery. This generalization 
comes naturally as a result of treating the problem in the joint-image space. 
In particular, the identification of the tangency relationship between the affine 
and projective cases and realization that the two-view projective joint image is 
a cone are the key contributions of the paper that enable this generalization. 
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are used to recover the epipoles directly. The fundamental matrix is recovered from the 
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We show the quality of the fundamental matrix on a number of hand-selected points. 
Note that the epipolar lines pass near the matching points. 
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Abstract. The critical configurations for projective reconstruction from 
three views are discnssed. A set of cameras and points is said to be critical 
if the projected image points are insnfScient to determine the placement 
of the points and cameras uniquely, up to projective transformation. For 
two views, the classification of critical configurations is well known - 
the configuration is critical if and only if the points and camera centres 
all lie on a ruled quadric. For three views the critical configurations 
have not been identified previously. In this paper it is shown that for 
any placement of three given cameras there always exists a critical set 
consisting of a fourth-degree curve - any number of points on the curve 
form a critical set for the three cameras. Dual to this result, for a set of 
seven points there exists a fourth-degree curve such that a configuration 
of any number of cameras placed on this curve is critical for the set of 
points. Other critical configurations exist in cases where the points all 
lie in a plane, or one of the cameras lies on a twisted cubic. 



1 Introduction 

The critical configurations for one and two views of a set of points are well un- 
derstood. For one view the critical sets consist of either a twisted cubic, or plane 
plus a lineOl- Camera position can not be determined from the image projections 
if and only if the camera and the points lie in one of these configurations. This 
is a classic result reintroduced by Buchanan (P). 

For two views the critical configuration consists of a ruled quadric, that is, 
a hyperboloid of one sheet, or one of its degenerate versions. Any configuration 
consisting of two cameras and any number of points lying on the ruled quadric 
is critical. An interesting dual result proved by Maybank and Shashua (nm) 
is that a configuration of six points and any number of cameras lying on a 
ruled quadric is critical. This result though originally proved using sophisticated 
geometric techniques was subsequently shown to follow easily from the two- view 
critical configuration result using Carlsson duality ( I2C2E!). 

No paper analyzing the three-view critical configurations has previously been 
published. An unpublished paper by Shashua and Maybank (P|) addressed this 

^ Configurations consisting of degenerate forms of a twisted cubic also exist 
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problem but did not identify any critical configurations other than ones consi- 
sting of isolated points. In this paper it is shown that various critical configura- 
tions exist for three views. Different types of critical surface exist, in particular: 



1. A fourth-degree curve, the intersection of two quadric surfaces. If the cameras 
and points lie on this curve, then the configuration is critical. 

2. A set of points all lying on a plane and any three cameras lying off the plane. 

3. A configuration consisting of points lying on a twisted cubic and at least one 
of the three cameras also lying on the twisted cubic. 

No attempt is made in this paper to determine if this is an exhaustive list of 
critical surfaces for three view, though this would not be unlikely. 

Application of duality to the first of these cases generates a critical curve for 
any number of views of seven points. If all cameras lie along a specific fourth- 
degree curve, the intersection of two ruled quadrics, then the configuration is 
critical. 

Although critical configurations exist for three views, they are much less 
common than for two views, and most importantly the critical configurations 
are of low dimension, being the itersection of quadric surfaces, whereas in the 
two-view case the critical surface has codimension one. In addition in the two 
view case there is much more freedom in finding critical surfaces. One can go 
as far as to specify two separate pairs of cameras (P,P') and (Q, Q') up front. 
There will always exist a ruled quadric critical surface for which two projective 
reconstructions exist, with cameras (P, P') in the one reconstruction, and came- 
ras (Q, Q') in the other. In the three-view case this is not true. If two camera 
triples (P,P',P") and (Q,Q',Q") are specified in advance, then the critical set on 
which one can not distinguish between them consists of the intersection of three 
quadrics, generally consisting of at most eight points. 

Notation In this paper, the camera matrices are represented by P and Q, 3D 
points by P and Q, and corresponding 2D points by p = PP or q = QQ. Thus 
cameras and 3D point are distinguished only by their type-face. This may appear 
to be a little confusing, but the alternative of using subscripts or primes proved 
to be much more confusing. In the context of ambiguous reconstructions from 
image coordinates we distinguish the two reconstructions by using P and P for 
one, and Q and Q for the other. 

2 Definitions 

We begin by defining the concept of critical configurations of points and cameras. 
These are essentially those configurations for which a unique projective recon- 
struction is not possible. The following definitions will be given for the two- view 
case, but the extension to three views is immediate. In fact it is the three-view 
case that we will mainly be interested in in this paper, but we will need the 
two- view case as well. 
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A configuration of points and camera is a triplc0 {P, P', P^} where P and P' are 
camera matrices and are a set of 3D points. Such a configuration is called a 
critical configuration if there exists another inequivalent configmatioii {Q, Q', Q^} 
such that PPi = QQi and P'P^ = Q'Q^ for all i. 

Unspecified in the last paragraph was what is meant by equivalent. One 
would like to define two configurations as being equivalent if they are related via 
a projective transformation, that is there exists a 3D projective transformation 
H such that P = and P' = and Pi = HQi for all i. Because of a 

technicality, this definition of equivalence is not quite appropriate to the present 
discussion. This is because from image correspondences one can not determine 
the position of a point lying on the line joining the two camera centres. Hence, 
non-projectively-equi valent reconstructions will always exist if some points lie 
on the line of camera centres. (Points not on the line of the camera centres are 
of course uniquely determined by their images with respect to a pair of known 
cameras.) This type of reconstruction ambiguity is not of great interest, and 
so we will modify the notion of equivalence by defining two reconstructions to 
be equivalent if H exists such that P = QH~^ and P' = Assuming that 

PPi = QQi and P'P^ = Q'Qi, such an H will also map P^ to Qi, except possibly 
for reconstructed points Pi lying on the line of the camera centres. This condition 
is also equivalent to the condition that Fp = Fq (up to scale of course), where Fp 
and Fq are the fundamental matrices corresponding to the camera pairs (P,p 0 
and (Q, Q'). 

Thus, a critical configuration is one in which one can not reconstruct the 
cameras uniquely from the image correspondences derived from the 3D points 
- there will exist an alternative inequivalent configuration that gives rise to 
the same image correspondences. The alternative configuration will be called a 
conjugate configuration. 

We now show the important result that the property of being a critical con- 
figuration does not depend on any property of the camera matrices involved, 
other than their two camera centres. The following remark is well known and 
easily proved, so we omit the proof. 

Proposition 1. Let P and P' he two camera matrices with the same centre. Then 
there exists a 2D projective image transformation represented by a non-singular 
matrix H such that P' = HP. Conversely, for any such matrix H, two cameras P 
and P' = HP have the same centre. 

This proposition may be interpreted as saying that an image is determined 
up to projectivity by the camera centre alone. It has the following consequence. 

Proposition 2. If {P,P',Pi} is a critical configuration and P and P are two 
cameras with the same centres as P and P' respectively, then {P,P ,Pi} is a 
critical configuration as well. 

Proof. This is easily seen as follows. Since {P,P',Pi} is a critical configuration 
there exists an alternative configuration {Q,Q',Qi} such that PP^ = QQi and 



2 



In the three-view case, there will be an extra camera P" of course. 
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P'Pi = Q'Qi for all i. However, since P and P have the same camera centre, 
P = HP according to Proposition 2 and similarly P = H'P'. Therefore 

PPi = HPPi = HQQi and 
P'P, = H'p'Pi = H'Q'Qi . 

It follows that {HQ, H'Q', Q^} is an alternative configuration to (Pq, Pq, Pi}, which 
is therefore critical. 



3 Two View Ambiguity 

The critical configurations for two- view reconstruction are well known : A confi- 
guration is critical if and only if the points and the two camera centres all lie on 
a ruled quadric (in the non-degenerate case, a hyperboloid of one sheet). What 
is perhaps not so well appreciated is that one may choose both pairs of camera 
matrices in advance and find a critical surface. 

It is customary to represent a quadric by a symmetric matrix S. A point 
will lie on the quadric if and only if P^SP = 0. However, notice that it is not 
essential that the matrix S be symmetric for this to make sense. In the rest of 
this paper quadrics will commonly be represented by non-symmetric matrices. 
Note that P^SP = 0 if and only if P^(S -P S^)P = 0. Thus, S and its symmetric 
part S -P represent the same quadric. 

Lemma 1. Consider two pairs of cameras (P,p0 (^n-d (Q,Q0> with corresponding 
fundamental matrices Fp/p and Fq/q. Define a quadric Sp = P'^Fq/qP^, and Sq = 

Q'Tfp,pQT. 

1. The quadric Sp contains the camera centres o/P and?' . Similarly, Sq contains 
the camera centres of Q and Q' . 

2. If P and Q are 3D points such that PP = QQ and P'P = Q'Q, then P lies on 
the quadric Sp, and Q lies on Sq. 

3. Conversely, ifP is a point lying on the quadric Sp, then there exists a point 
Q lying on Sq such that PP = QQ and P'P = Q'Q. 

4-. If eq is the epipole defined by Fq/qeq = 0, then the ray passing through Cp 
consisting of points P such that eq = PP lies on the quadric Sp. 

Proof. The matrix Fp/p corresponding to a pair of cameras (P, P') is characterized 
by the fact that P'^Fp/pP is skew-symmetric (0). Since Fp/p Fq/q, however, the 
matrices Sp and Sq defined here are not skew-symmetric, and hence represent 
well-defined quadrics. 

We denote the centre of a camera with matrix such as P by Cp. Then 

1. The camera centre of P satisfies PCp = 0. Then Cp^SpCp = Cp^(P'^Fq/qP)Cp 
= Cp^(P'^Fq/q)PCp = 0, since PCp = 0. So, Cp lies on the quadric Sp. In a 
similar manner, Cp/ lies on Sp. 
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2. Under the given conditions one sees that 

P^SpP = P^P'^Fq-qPP = Q^(Q'^Fq-qQ)Q = 0 

since Q^^Fq/qQ is skew-symmetric. Thus, P lies on the quadric Sp. By a similar 
argument, Q lies on Sq. 

3. Let P lie on Sp and define p = PP and p' = P'P. Then, from P^SpP = 0 we 
deduce 0 = P^P'^Fq/qPP = p'^Fq/qp, and so p' O p are a corresponding 
pair of points with respect to Fq/q. Therefore, there exists a point Q such 
that QQ = p = PP, and Q'Q = p' = P'P. From part 2 of this lemma, Q must 
lie on Sq. 

4. For a point P such that eq = PP one verifies that SpP = P'^Fq/qPP = 
P'^Fq/qOq = 0, SO P lies on Sp. 

This lemma completely describes the sets of 3D points giving rise to ambi- 
guous image correspondences. Note that any two arbitrarily chosen camera pairs 
can give rise to ambiguous image correspondences, provided that the world points 
lie on the given quadrics. The quadric Sp is a ruled quadric, since it contains a 
ray. 



4 Three View Critical Surfaces 

We now turn to the main subject of this paper - the ambiguous configurations 
that may arise in the three-view case. To distinguish the three cameras, we use 
superscripts instead of primes. Thus, let P°, P^, P^ be three cameras and {Pi} be 
a set of points. One asks under what circumstances there exists another configu- 
ration consisting of three other camera matrices Q°, and and points {Qij 
such that P-’Pi = Q-^ Qi for all i and j. One requires that the two configurations 
be projectively inequivalent. 

Various special ambiguous configurations exist. 



Points in a Plane 

If all the points lie in a plane, and Pi = Qi for all i, then one may move any of 
the cameras without changing the projective equivalence class of the projected 
points. Then one may choose P-^ and Q-^ with centres at any two preassigned 
locations in such a way that P^ Pi = Q^ Qi. This ambiguity has also been observed 

in [TT|. 



Points on a Twisted Cubic 

One has a similar ambiguous situation when all the points plus one of the ca- 
meras, say P^ lie on a twisted cubic. In this case, one may choose Q° = P°, and 
== P^ and the points Qi = Pi for all i. Then according to the well known am- 
biguity of camera resectioning for points on a twisted cubic (fP) for any point 
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Cq on the twisted cubic, one may choose a camera matrix with centre at 
such that P^Pi = Q^Qi for all i. 

These examples of ambiguity are not very interesting, since they are no more 
than extensions of the 1-view camera resectioning ambiguity. In the above ex- 
amples, the points Pi and Qi are the same in each case, and the ambiguity lies 
only in the placement of the cameras with respect to the points. More interesting 
ambiguities may also occur, as we consider next. 



General 3-View Ambiguity 

Suppose that the camera matrices (P°,P^,P^) and (Q°,Q^,Q^) are fixed, and we 
wish to find the set of all points such that P®P = Q*Q for f = 0, 1, 2. Note that 
we are trying here to copy the 2- view case in which both sets of camera matrices 
are chosen up front. Later, we will turn to the less restricted case in which just 
one set of cameras are chosen in advance. 

A simple observation is that a critical configuration for three views is also 
a critical set for each of the pairs of views as well. Thus one is led naturally to 
assume that the set of points for which {P°, P^, P^, Pi} is a critical configuration is 
simply the intersection of the point sets for which each of {P°, P^, Pi}, {P^, P^, Pi} 
and {P°, P^, Pi} are critical configurations. Since by lemma 1 each of these point 
sets is a ruled quadric, one is led to assume that the critical point set in the 
3- view case is simply an intersection of three quadrics. Although this is not far 
from the truth, the reasoning is somewhat fuzzy. The crucial point missing in 
this argument is that the corresponding conjugate points may not the same for 
each of the three pairs. 

More precisely, corresponding to the critical configuration {P°,P^,Pi}, there 
exists a conjugate configuration {Q°,Q^,Q°^} for which P-^P^ = for j = 

0, 1. Similarly, for the critical configuration {p'^,P^,Pi}, there exists a conjugate 
configuration {Q°,Q^,Q?^} for which P^P^ = for j = 0,2. However, the 

points are not necessarily the same as so we can not conclude that 
there exist points Qi such that P-1 Pi = Q-lQi for all i and j = 0,1,2 - at least 
not immediately. 

We now consider this a little more closely. Considering just the first pairs of 
cameras (P°,P^) and (Q°,Q^), lemma 1 tells us that if P and Q are points such 
that pip = Q-IQ, then P must lie on a quadric surface Sp^ determined by these 
camera matrices. Similarly, point Q lies on a quadric Sq^. Likewise considering 
the camera pairs (P'^,P^) and (Q°,Q^) one finds that the point P must lie on a 
second quadric Sp^ defined by these two camera pairs. Similarly, there exists a 
further quadric defined by the camera pairs (P^,P^) and (Q^,Q^) on which the 
point P must lie. Thus for points P and Q to exist such that P-lp = qIq for 
2 = 0, 1, 2 it is necessary that P lie on the intersection of the three quadrics : 
P G Sp^ n Sp^ n Sp^. It will now be seen that this is almost a necessary and 
sufficient condition 0 

® A reviewer of this paper reports that parts of this theorem were known to Buchanan, 
but I am unable to provide a reference. 
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Theorem 1. Let (P°,P^,P^) and (Q°,Q^,Q^) be two triplets of camera matrices 
and assume P° = Q°. For each of the pairs {i,j) = (0,1), (0,2) and (1,2), let 
Sp-^ and S(j^ he the ruled quadric critical surfaces defined for camera matrix pairs 
(P®,P-^) and (Q*,Q-^) as in lemma 1. 

1. If there exist points P and Q such that P®P = Q*Q for all i = 0, 1, 2, then P 
must lie on the intersection Sp^ fl Sp^ fl and Q must lie on Sg^ fl Sg^ fl Sg^. 

2. Conversely, ifV is a point lying on the intersection of quadrics Sp^flSp^nSp^, 
but not on a plane containing the three camera centres Cg, Cj and Cg, then 
there exists a point Q lying on Sg^ fl Sg^ fl Sg^ such that P*P = Q*Q for all 
i = 0,1,2. 

Note that the condition that P'^ = Q° is not any restriction of generality, since 
the projective frames for the two configurations (P°,P^,P^) and (Q°,Q^,Q^) are 
independent. One may easily choose a projective frame for the second configu- 
ration in which this condition is true. This assumption is made simply so that 
one may consider the point P in relation to the projective frame of the second 
set of cameras. 

The extra condition that the point P not lie on the plane of camera centres 
Cg is necessary, as will be seen later. Note that in most cases this case will not 
arise, however, since the intersection point of the three quadrics with the trifocal 
plane will be empty, or in special cases consist of a finite number of points. 

Proof. For the first part, the fact that the points P and Q lie on the intersections 
of the three quadrics follows (as pointed out before the statement of the theorem) 
from lemma 1 applied to each pair of cameras in turn. 

To prove the converse, suppose that P lies on the intersection of the three 
quadrics. Then from lemma 1, applied to each of the three quadrics Sp'^, there 
exist points Qb such that the following conditions hold : 

P°P = ; pip = q1q°i 

P°P = ; p2p = 

pip = qIqI^ ; P^P = Q^qI^ 

It is easy to be confused by the superscripts here, but the main point is that 
each line is precisely the result of lemma 1 applied to one of the three pairs of 
camera matrices at a time. Now, these equations may be rearranged as 

pOp = Q°qOi = 
pip = q1q°i = qIqI^ 

p2p = ^ q2q12 

Now, the condition that q1q°i = QIqI^ means that the points Q°i and q 1^ 
are collinear with the camera centre Cg of q1. Thus, assuming that the points 
Qb are distinct, they must lie in a configuration as shown in Fig^ One sees 
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from the diagram that if two of the points are the same, then the third one is 
the same as the other two. If the three points are distinct, then the three points 
and the three camera centres Cg are coplanar, since they all lie in the plane 
defined by and the line joining to Thus the three points all lie in 
the plane of the camera centres Cg. Howevever, since P°P == and 

pO = qO, it follows that P must lie along the same line as and and hence 
must lie in the same plane as the camera centres Cg. 




Fig. 1. Configuration of the three camera centres and the three ambiguous points. If the 
three points Q*-’ are distinct, then they all lie in the plane of the camera centres CJ . 



In general, the intersection of three quadrics will consist of eight points. In 
this case, the critical set with respect to the two triplets of camera matrices 
will consist of these eight points alone. In some cases, however, the camera 
matrices may be chosen such that the three quadric surfaces meet in a curve. 
This will occur if the three quadrics Sp'^ are linearly dependent. For instance if 
Sp^ = aSp^-|-/3Sp^, then any points P that satisfies P^Sp^P = 0 and P^Sp^P = 0 
will also satisfy P^Sp^P = 0. Thus the intersection of the three quadrics is the 
same as the intersection of two of them, which will in general be a fourth-degree 
space curve. 



An Example 

As a specific example of ambiguity, consider the following configuration. Let 



P° = [I I 0] Q° = [I I 0] 
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In this case, one may verify that 





'0 1 O' 




'0 0 -1' 




'0 1 -1' 


pio _ 


0 0 1 
0-10 


■ p20 _ 


1 0 0 
1 0 0 


■ p21 _ 


1 0 0 
1 0 0 



and from lemma 1 one may compute that the quadric surfaces Sp^ = Sp^, both 
represent the quadric XY = z represented by the matrix 

'0 1 
1 0 

0 -1 
-1 0 

The intersection of this quadric with Sp^ will be a curve. In fact, for any t, let 
Y(t) = 1 — ± — and 

Pi = (t,Y(t),tY(t),l)T 

Qt = (t, Y(t), tY(t), (Y(t) - t)/{l + t))^ . (1) 

One may then verify that P®Pi = Q*Qt for all i and t. One alse verifies that 
all the three camera centres Cp = (0,0,0,!)^, Cj = (1,1,1,!)^ and Cp = 
(— 1, —1, 1, 1)^ lie on the curve P*. 

The method of discovering this example was to start with the camera ma- 
trices P*, and then compute the required fundamental matrices Fg° and ne- 
cessary to ensure that the quadrics Sp^ and Sp^ have the desired form. From the 
fundamental matrices one then computes the matrices by standard means. 

Note that this example may appear a little special, since two of the quadrics 
are equal. However, this case is only special, because we are choosing all six 
camera matrices in advance. Using this example, we are now able to describe a 
critical set for any configuration of three cameras. 

Theorem 2. Given three cameras (P°,P^,P^) with non-collinear centres, there 
exists (at least) a fourth-degree curve Pt formed as the intersection of two ruled 
quadrics containing the three camera centres that can not he uniquely reconstruc- 
ted from projections from these three camera centres. In particular, there exist 
three alternative cameras Q* and another fourth-degree curve Q* such that for all 
i and t 

P*Pt = q*Qt 

and such that the two configurations {P°, P^, P^, Pt} and {Q°, Q^, Q^, Qt} are not 
projectively equivalent. 

Proof. The proof is quite simple. Since the three camera centres are non-collinear 
one may transform them by a projective transform if necessary to the three 
camera centres Cp = (0,0,0,!)^, Cp = (1,1,1,!)^ and Cp = (-1,-1, 1,1)^ 
of the foregoing example. Now, applying Proposition 2 we may assume that 
the three cameras are identical with the three cameras P* of the example. Now, 
choosing Q®, P* and Q* as in the example gives the required reconstruction 
ambiguity. 
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It is significant to note that the critical curve for the three specified cameras 
in Theorem 2 is not unique even for fixed camera matrices - rather there exists a 
6-parameter family of such curves, since any projective transformation that maps 
the three camera centres to themselves will map the critical curve to another 
critical curve. Summing up, given three fixed cameras (P*^,P^,P^) In total we 
have identified the following critical configurations : 

1. A six-parameter family of fourth-degree curves containing the three camera 
centres. 

2. Any plane not containing the camera centres. 

3. Any twisted cubic passing through one of the three camera centres. 

Can All Three Quadrics Be the Same? 

It is natural to ask whether it is possible to choose camera matrices so that all 
three quadrics Sp^ , Sp^ and Sp^ are equal, and whether in this case this constitutes 
a critical set for all three cameras. The answer to this question is yes and no - 
it is possible to choose the camera matrices such that the three quadrics Sp-^ are 
the same, but this does not constitute a critical surface for the three cameras, 
since the three quadrics Sq^ are different. This seems to contradict Theorem [Q 
but it in fact does not, as we shall see in the following discussion. We consider 
only the case where the three camera centres for P* are non-collinear. 

Since all hyperboloids of one sheet are projectively equivalent, one can assume 
that each Sp'^ is the quadric XY = Z. Then there are sufficiently many remaining 
degrees of freedom to allow as to assume that the three camera centres are at 
(0, 0, 0)^, (1, 1, 1)^ and (-1,-1, 1). (This is valid, unless two of the centres lie on 
the same generator of the quadric.) We can therefore conclude that the camera 
matrices P-^ are the same as in the example above. Next, we wish to find the 
fundamental matrices Fg"' . The constraint that Sp^ is the quadric XY = Z in each 
case constrains the form of computed according to the formula Sp'^ = P^^Fq-’P-^ 
given in lemma 1. One finds that there are only two possibilities for each Fq'^. 
One possibility is 





'0 1 o' 




'0 1 o' 




'0 1 o' 


£D 

o 

II 


0 0 1 
0-10 


. p20 _ 

1 rg — 


0 0-1 
0 1 0 


. p21 

■ — 


1 0 1 
0-10 



The other possibility for each of the three fundamental matrices is obtained by 
simultaneously swapping the first two rows and the first two columns of each 
fundamental matrix. Thus there are two choices for each Fg'^, making a total of 
eight choices in all. However to be compatible the three fundamental matrices 
must satisfy coplanarity constraints. Specifically, denoting an epipole in the j-th 
view as eO, one requires that e^*^Fg-^eO = 0 for all choices of i,j,k = 1,2,3. 
This condition rules out all choices of Fg^ except for the ones in 0 and the set 
obtained by swapping the first two rows and columns of all three Fg'^ at once. 
This second choice of Fg-^ is substantially the same as the one in (0 , and hence 
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we may assume that the three fundamental matrices are as in Q). Now one 
observes that the epipoles and obtained as the right null-vectors of 
and Fq° are both the same, equal to (1, 0, 0)^. This means that the three camera 
Q* are collinear. This gives the curious result : 

— Suppose t/iat (P°, P^, P^) and (Q°,Q^,Q^) are two triplets of cameras for which 
the three critical quadric surfaces Sp'^ are all equal. If the centres of cameras 
P® are noncollinear, then the centres of Q* are collinear. 



Finally, from the three fundamental matrices one can reconstruct the three ca- 
mera matrices Q*. Because the camera centres are collinear, there is not a unique 
solution - the general solution (up to projectivity) is 



'1 


0 


0 r 




a 


b c 


d ' 


0 


-1 


0 0 


; q" = 


0 


-1 0 


0 


1 


0 


-1 1 




—a 


—b — c — 1 


- d 



( 3 ) 



One can now compute the three quadrics explicitly using lemma 1 One finds 
that they are not the same. Thus, the three quadrics Sp^ are the same, but the 
three quadrics Sq are different, and so Sp'^ is not a critical surface for all three 
views. It follows from this that it is not possible for the intersection of all three 
quadrics to form a critical surface for all three views. 

How is this to be reconciled with Theorem Q] which states (roughly) that the 
critical point set is the intersection of the three quadrics Sp'^ ? The answer is in 
the exception concerning points that lie in the trifocal plane of the three camera 
centres of Q-1 . In the present case the centres of the three cameras Q® are collinear, 
so any point P lies in a common plane with the three camera centres and we 
are unable to conclude from Theorem Q] that there exists a point Q such that 
P*P = Q®Q. There are actually three points as in FigE 



More about Theorem Q] 

The second part of Theorem Q is useful only in the case where the three camera 
centres of the second set of cameras, Q®, are non-collinear, since otherwise any 
point lies on the plane of the three camera centres. The geometry of this plane 
is quite interesting, and so a few more remarks will be made here. 

Define 7 t° to be the plane passing through the centres of the three cameras 
Q® when a projective frame is chosen such that Q° = P° . Theorem Q states that 
if P is a point on the intersection of the three quadrics Sp'^ and not on the plane 
7T*^, then there exists a point Q such that P®P == Q®Q for all i. 

Now, there is nothing that distinguishes the first camera P° in this situation. 
One is free to choose the projective frame for the three cameras Q® independently 
of the P®. Note in particular that Sp'^ is unchanged by applying a projective 
transform to the camera matrices Q® and Q^, since it depends only on their 
fundamental matrix Fqiqj . Thus, one could just as well choose a frame for the 
cameras Q® such that = P^. The resulting plane of the three camera centres 
Q® would be a different plane, denoted tt^. Similarly one can obtain a further 
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plane by choosing a frame such that = P^. In general the planes tt® will be 
different. If the point P lies off one of the planes tt®, then one may conclude from 
Theorem m that a point Q exists such that P®P = Q®Q for all i. The preceding 
discussion may be summarized in the following corollary to Theorem ^ 

Corollary 1. Let (P°,P^,P^) and (Q°,Q^,Q^) be two triples of camera matrices, 
and assume that the three camera centres Cq are non-collinear. Let Sp'^ and 
he defined as in Theorem Q For each i = 1, . . . , 3, Zet H® be a 3D projective 
transformation such that P® = Q®(H®)“^. Let tt® be the plane passing through the 
three transformed camera centres H®Cq, H®Cj and H®Cq. 

1. If there exist points P and Q such that P®P = Q®Q for all z = 0, 1, 2, then P 
must lie on the intersection Sp^ fl Sp^ fl and Q must lie on Sg^ fi Sg^ fl Sg^. 

2. Conversely, ifV is a point lying on the intersection of quadrics Sp^fiSp^nSp^, 

but not on the intersection of the three planes then there exists 

a point Q lying on Sg^ fl Sg^ fl such that P®P = Q®Q for all i = 0, 1, 2. 

For cameras not in any special configuration, the three planes tt® meet in a 
single point, and apart from this point the critical set consists of the intersection 
of the three quadrics Sp-^ . 

The planes tt® have other interesting geometric properties, which allow them 
to be defined somewhat differently. This brief discussion requires an understan- 
ding of the geometry of ruled quadric surfaces, for which the reader is referred 
to 0. Refer back to FigD According to part 4 of Theorem [H the line between 
the centres of cameras P° = Q° and lies on the surface Sp^. Thus, the plane 7 r° 
meets Sp^ in one of its generators, and hence is a tangent plane to Sp^. Similarly, 
7 T° meets Sp^ in one of its generators, namely the line joining the centres of P'^ 
and Q^. Thus, 7 T° is a common tangent plane to Sp^ and Sp^, passing through 
the centre of camera P°, which lies on the two quadrics. (However, 7t° is not 
necessarily tangent to the surfaces at the centre of P°.) 

In a similar way it may be argued that is a tangent plane to the pairs of 
quadrics Sp^ and Sp^ and is tangent to Sp^ and Sp^. 

5 Ambiguous Views of Seven Points 

In 0 a general method was given based on a duality concept introduced by 
Carlsson (|^) for dualizing statements about projective reconstructions. The 
basic idea is that the Cremona transform (0) 

r : (x, Y, z, t) h> (yzt, xzt, xyt, xyz) 

induces a duality that swaps the role of points and camera, with the excep- 
tion of 4 reference points, the vertices of the reference tetrahedron, the points 
El = (1, 0, 0, 0)^, . . . , E 4 = (0,0,0,!)^. Relevant to the present subject is the 
observation (^) that the Carlsson map F takes a ruled quadric containing the 
points Ei to another ruled quadric. 

In dualizing the statement of Theorem 2 
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— the three non-linear camera centres become seven points not lying on a 
twisted cubic. 

— the intersection of two ruled quadrics remains an intersection of two ruled 
quadrics 

— The seven points must contain at least a set of four non-coplanar points to 
act as the reference tetrahedron. 

Theorem 3. Given a set of seven non-coplanar points Pj not lying on a twisted 
cubic, there exists a curve 7 formed by the intersection of two quadrics such that 
the projections of the Pj from any number of cameras P* with centres Cp lying on 
the curve 7 are insufficient to determine the projective structure of the points Pj 
uniquely. In particular, there exists an inequivalent set of points Qj and cameras 
Q* such that P*Pj = Q*Qj for all i and j. 

No proof of this is given here, since it follows almost immediately from Theo- 
rem 2 by an application of Carlsson duality. For a description of the general 
principal of duality as it relates to questions of this type, see 0. 

6 Summary of Critical Configurations 

The various critical configurations discussed here are summarized in the following 
table. 



Table 1. Summary of different critical configurations. 



Problem 


Dual 


Critical set 


2 views, 7 points 
(various authors, e.g. 0) 


3 views 6 points 
(Quan 0) 


ruled quadric 


2 views, n > 7 points 
(classical result, see pEI) 


n > 3 views, 6 points 
(Maybank-Shashua |K)|1 


ruled quadric 


3 views, n > 6 points 
(this paper) 


n > 2 views, 7 points 
(this paper) 


4-th degree 
curve, etc 



The minimal cases (first line of the table) are of course simply special cases of 
those considered in the next line of the table. They have, however been considered 
separately in the literature and have been shown to have either one or three real 
solutions. Note that these configurations involve 9 points (either scene points or 
camera centres). However, 9 points always lie on a quadric surface. There will 
be one or three solutions depending on whether the quadric is ruled or not. 
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7 Conclusions 

Although critical configurations for three views do exist, they are less common 
than for two views, and are of lower dimension. Thus for practical algorithms of 
reconstruction from three views it is safer to ignore the probability of encounte- 
ring a critical set than in the two view case. The exception is for a set of points 
in the plane, for which it will always be impossible to determine the camera 
placement. 

Though no formal claim is made that the list of critical configurations given 
here is complete, it shows that such configurations are more common than might 
have been thought. My expectation is that a closer analysis will show that in 
fact this list is substantially complete. 
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Abstract. It is known that recovering projection matrices from planar 
configurations is ambiguous, thus, posing the problem of model selection 
— is the scene planar (2D) or non-planar (3D)? For a 2D scene one would 
recover a homography matrix, whereas for a 3D scene one would recover 
the fundamental matrix or trifocal tensor. The task of model selection 
is especially problematic when the scene is neither 2D nor 3D — for 
example a “thin” volume in space. 

In this paper we show that for certain tasks, such as reprojection, there 
is no need to select a model. The ambiguity that arises from a 2D scene 
is orthogonal to the reprojection process, thus if one desires to use mul- 
tilinear matching constraints for transferring points along a sequence of 
views it is possible to do so under any situation of 2D, 3D or “thin” 
volumes. 



1 Introduction 

There are certain mathematical objects connected with multiple-view analysis 
which include: (i) homography matrix (2D collineation), and (ii) objects associa- 
ted with multilinear constraints — fundamental matrix, trifocal and quadrifocal 
tensors. Given two views of a planar configuration of features (points or lines) it 
is possible to recover the mapping between the views as a 2D collineation (ho- 
mography matrix) — a transformation that is also valid when the scene is 3D 
but the relative camera geometry consists of a pure rotation. On the other hand, 
when the camera motion is general and the scene consists of a three-dimensional 
configuration of features, then the valid transformations across a number of views 
consists of multi-linear relations that perform a variety of point-to-line mappings. 
The coefficients of the multilinear constraints encode the relative camera geome- 
try, the projection matrices, and form a matrix in two views, a 3 x 3 x 3 tensor 
in three views and a3x3x3x3 tensor in four views. 

The objects of multi-view analysis are often used for purposes of reconstruc- 
tion, i.e., 3D modeling from a collection of views, and for purposes of feature- 
transfer (reprojection) i.e., predict the image location of a point (line) in some 
view given its locations in two other views. The reprojection paradigm is useful 
for feature tracking along image sequences, mosaicing, and image based rende- 
ring. 



D. Vernon (Ed.): ECCV 2000, LNCS 1842, pp. 936-[^3 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



On the Reprojection of 3D and 2D Scenes 937 



Regardless of the application, it seems necessary to know in advance whether 
the scene viewed by the collection of images is 2D or 3D. Because, in case the 
scene is 2D the multilinear constraints are subject to an ambiguity — rank-6 
estimation matrix (instead of 8) for the fundamental matrix and rank-21 (instead 
of 26) for the trifocal tensor. Hence arises the issue of model selection. There 
has been a large body of research in the general area of model selection for 
purposes of segmentation (due to shape, motion), and field of view (orthographic 
versus perspective) flj. Whatever the scheme of model selection is chosen, it 
is problematic in the sense that often a decision is to be made in uncertain 
conditions — in our case, for example, when the scene is neither purely planar 
nor spans a sufficiently large 3D volume. 

In this paper we show that in the case of multilinear constraints, it is not 
necessary to decide on a model i.e., whether a homography matrix is better suited 
than a fundamental matrix for example, for purposes of reprojection. Our results 
show that the null space, or the ambiguity space in general, of the estimation of 
multilinear constraints (fundamental matrix and trifocal tensor) is orthogonal to 
the task of reprojection. In other words, in a situation of three views of a planar 
scene the 6-dimensional null space of the trifocal tensor estimation is completely 
admissible for reprojection of features arising from the planar surface. Moreover, 
generally the space of uncertainty in recovering certain parameters of the tensor 
due to insufficient “3D volume” of the sampled surface is again orthogonal to 
reprojection of features arising from the sampled volume. 



2 Notations and Necessary Backgronnd 

We will be working with the projective 3D space and the projective plane. In 
this section we will describe the basic elements we will be working with (i) 
homography matrix, (ii) camera projection matrices, (iii) fundamental matrix, 
(iv) tensor notations, and (v) trifocal tensor. 

A point in the projective plane is defined by three numbers, not all zero, 
that form a coordinate vector defined up to a scale factor. In the projective 
plane any four points in general position can be uniquely mapped to any other 
fours points in the projective plane. Such a mapping is called collineation and 
is defined by 3 x 3 invertible matrices, defined up to scale. These matrices are 
sometimes referred to as homographies . A collineation is defined by 4 pairs of 
matching points, each pair provides two linear constraints on the entries of the 
homography matrix. If A is a homography matrices defined by 4 matching pairs 
of points, then (inverse transpose) is the dual homography that maps lines 
onto lines. 

The projective plane is useful to model the image plane. Consider a collection 
of planar points Pi, ..., in space living on a plane tt viewed from two views. 
The projections of Pi are Pi,Pi in views 1,2 respectively. Because the collineations 
form a group, there exists a unique homography matrix At^ that satisfies the 
relation A,rPi = i = l,...,n, and where A,r is uniquely determined by 4 
matching pairs from the set of n matching pairs. 
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A point in 3D projective space is defined by four numbers, not all zero, that 
form a coordinate vector defined up to a scale factor. A camera projection is a 
3x4 matrix which corresponds between points in 3D projective space to points in 
the projective plane. A useful parameterization (which is the one we adopt in this 
paper) is to have the 3D coordinate frame and the 2D coordinate frame of view 1 
aligned. Thus, in the case we have three views, then the three camera projection 
matrices between the 3D projective space and the three image planes are denoted 
by [/; 0], [A; u'], [B; u"] associated with views 1,2,3 respectively. These camera 
matrices are not uniquely defined, as there is a 3-parameter degree of freedom 
(“gauge” of the system) as [/; 0], [A -I- v'w^] v'], [B + v"w"^] v"] agree with the 
same image data for all choices of w. The multi- view tensors which we will define 
next are gauge-invariant, i.e., they are invariant to the choice of w. 

The 3x3 principle minor of the camera matrix, under this kind of paramete- 
rization, is a homography matrix. The choice of gauge parameters determine the 
position of the plane associated with the homography — the reference plane. In 
particular, the space of all homography matrices between views 1,2 (up to scale) 
is A -I- v'w"^ . 

The simplest multi- view tensor is the fundamental matrix F = [u]a,A whose 
entries are the coefficients of the bilinear matching constraint p'^ Fp = 0, where 
p,p' are matching points in views 1,2 respectively. Note that F is gauge invariant 
as [v]a:{A + v'w^) = [v]xA. 

It will be most convenient to use tensor notations from now on because the 
multi-view tensors couple together pieces from different projections matrices into 
a “joint” object. When working with tensor objects the distinction of when co- 
ordinate vectors stand for points or lines matters. A point is an object whose 
coordinates are specified with superscripts, i.e., p* = (p^,p^,p^). These are called 
contravariant vectors. A line in is called a covariant vector and is represented 
by subscripts, i.e., Sj = {si, S 2 , S 3 ). Indices repeated in covariant and contrava- 
riant forms are summed over, i.e., p*Sj = p^si + p'^S 2 +P^S 3 . This is known as a 
contraction. For example, if p is a point incident to a line s in then p*Si = 0. 

Vectors are also called 1-valence tensors. 2-valence tensors (matrices) have 
two indices and the transformation they represent depends on the covariant- 
contravariant positioning of the indices. For example, is a mapping from 
points to points (a collineation, for example), and hyperplanes (lines in V^) 
to hyperplanes, because a^p* = and a^Sj = Vi (in matrix form: Ap = q 
and A^s = r); maps points to hyperplanes; and a*-! maps hyperplanes to 
points. When viewed as a matrix the row and column positions are determined 
accordingly: in and aji the index i runs over the columns and j runs over the 

rows, thus is BA = C in matrix form. An outer-product of two 1- 

valence tensors (vectors), aiV , is a 2-valence tensor whose i,j entries are UiV 
— note that in matrix form C = baA . A 3- valence tensor has three indices, say 
. The positioning of the indices reveals the geometric nature of the mapping: 
for example, p^SjFlA must be a point because the i,j indices drop out in the 
contraction process and we are left with a contravariant vector (the index k is 
a superscript). Thus, F[)!^ maps a point in the first coordinate frame and a line 
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in the second coordinate frame into a point in the third coordinate frame. The 
“trifocal” tensor in multiple-view geometry is an example of such a tensor. A 
single contraction, say , of a 3- valence tensor leaves us with a matrix. Note 
that when p is (1, 0, 0) or (0, 1, 0), or (0, 0, 1) the result is a “slice” of the tensor. 

The 3x3x3 trifocal tensor is defined below: 

= ( 1 ) 



The elements of the tensor are coefficients of trilinear constraints on triplets 
of matching points across the three views. Let p,p',p" be a matching triplet 
of points, i.e., they are projections of some point in 3D. Let s be some line 
coincident with p' , i.e., Sjp'^ = 0, and let r be some line through p" . Then, 

p^s.TkV' = 0 . ( 2 ) 

Because p' is spanned by two lines (say, the horizontal and vertical scan lines) and 

r as well, a triplet p,p' ,p” generate 4 “trilinearities” each is a linear constraint 
on the elements of the tensor. Thus 7 matching points (or more) are sufficient 
to solve for the tensor. Note that the trifocal tensor is also gauge invariant as: 

Tt = v'^l^-v"M (3) 

= v'^ {b’: + w,v”^) - v"^ (ai + w,v '^ ) (4) 

= + Wiv'^v"^ — Wiv'^v"^ (5) 

= rt ( 6 ) 

Once the trifocal tensor is recovered from image measurements (matching 

triplets, or matching lines, or matching points and lines) the task of “recon- 
struction” is to extract the camera projection matrices (up to a choice of gauge 
parameters) from the tensor. We will not discuss this here. The task of “repro- 
jection” is to predict (or “back-project”) the location of p" using the matching 
pair p,p' and the tensor. This is done simply as: 



and since there are two choices for s we have a redundant system for extracting 



p". 



These were the necessary details we need for the rest of the paper. More de- 
tails on the trifocal tensor can be found in the review mg and in (not exhaustive) 



3 Reprojection of a Planar Surface from Multilinear 
Constraints 

In case the scene is indeed 3D there is a one-to-one mapping between tensors 
that satisfy Eqn. Q and tensors that satisfy Eqn. |21 However, when the scene is 
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planar then a tensor that satisfies Eqn.|3does not necessarily satisfy Eqn. Q as 
we will see now. 

Consider a collection of matching point triplets p,p',p” of a planar scene 
7T in views 1,2,3, respectively. Because the scene is planar there exist a unique 
homography matrix A from views 1 to 2, i.e., Ap = p' and a unique homography 
matrix B from views 1 to 3, i.e.. Bp = p" . Let i5, p be arbitrary vectors, then the 
tensor 

= ( 7 ) 

satisfies the trilinearity - Eqn. El To see why this is so, note that Ap = 0 and 
Bp = 0 for triplets p,p' ,p” arising from the plane tt. We have therefore: 

p^SjTkTj’’^ = p’^SjVkiS^b^ - p^al) 

= {sjS^){rkb-p^) - {rkp'"){sjajp^) 

= {s^ S){r^ Bp) — {r^ p){s^ Ap) = 0, 

and this holds for all choices of the vectors S, p. As argued in m this entails 
that the rank of the estimation matrix for the trifocal tensor from measurements 
arising from a planar surface is at most 21 (instead of 26). In other words, there 
are 6 degrees of freedom due to the indeterminacy of the epipoles (<5, p) . What is 
left to show is that all the solutions in the null space are in the form of Eqn. 0 
To see that note that A, B can be homographies due to any other plane tt and 
still satisfy the trilinearity (Eqn. EJ if and only if 5 = u' and p = v" are the 
true epipoles: Let A = XA + v'ri^ and B = XB + v”-nJ be the homographies 
associated with the plane tt, then 

^ v'^iXb^y + n,v"^) - v"^(Xaj + n^v'^) 
for all choices of A, n and thus in particular 

fii^k nk j MT.k ffk-j 
V'^bi — V a'l = V 0^ — V . 

To conclude, because the epipoles cannot be determined from the trilinearities 
(Eqn. El then all the tensors in the null space are of the form of Eqn. Q where 
A, B are the homographies due to the plane tt. 

We have, therefore, an ambiguity whose source arises from the uncertainty 
in recovering the epipoles from the image measurements. Thus, recovering pro- 
jection matrices is not possible. Yet, consider the problem of reprojection: 

= (sfyfy6,V-/(s,aA 

= {s^S)Bp — {s^Ap)p = p". 

In other words, for all choices of S,p, a matching point and line in views 1,2 
uniquely determine the location of the matching point in view 3, provided that 
the matching triplet p,p',p" arise from the plane tt. We can conclude, therefore, 
that the null space for estimating the trifocal tensor from image measurements 
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arising from a planar surface is orthogonal to the reprojection equation p^SjTf^ 
where the matching points arise from the same planar surface that was sampled 
in the process of recovering . 

In practical terms, given a collection of matching triplets p,p',p" sampling a 
certain volume in space, each triplet provides 4 linear equations for the trifocal 
tensor. The eigenvector associated with the smallest eigenvalue of the estimation 
matrix is the trifocal tensor. In case the matching triplets came from a 3D scene, 
the solution is unique whereas in case the the matching triplets came from a 
planar configuration the solution is not unique (the 6 eigenvectors corresponding 
to the 6 smallest eigenvalues span the solution space) — but that does not 
matter, as long as the matching points used for the estimation of the tensor 
span the scene volume of interest (if the points came from a plane it means that 
the scene is planar, for example), then the reprojection is valid nevertheless. The 
following theorem summarizes the findings so far: 

Theorem 1. In case a collection of matching triplets p,p' ,p" whose correspon- 
ding 3D points sample some volume in space are given, then the eigenvector 
associated with the smallest eigenvalue of the estimation matrix to the trifocal 
tensor forms a trifocal tensor that is valid for reprojecting point p,p' onto p” , 
regardless of whether the volume is a 2D plane or a 3D volume, provided that 
the corresponding 3D points come from the same volume in space sampled during 
the estimation process. 

This state of matters is not characteristic solely to the trifocal tensor, it is 
a general geometric property. Consider performing reprojection using pairwise 
fundamental matrices, for example. Let Fi^ be the fundamental matrix satisfying 
p”^ Fi 3 P = 0 for all matching pairs p,p” , and let F23 be the fundamental matrix 
satisfying p"^ F23P' = 0 for all matching pairs p' ,p” ■ The reprojection equation 
is an intersection of epipolar lines: 

P" = Fisp X F23P- 

Generally it is not a good idea to rely on epipolar intersection as it becomes 
degenerate when the three camera centers are collinear, but nevertheless this 
provides an alternative to the reprojection equation using the trifocal tensor. 
When the triplet p,p' ,p" arise from a planar configuration, then ^13 = [S\xB and 
F23 = [iAxBA~^ satisfy the bilinear constraints p""^ Fi^p = 0 and p''^ F23P' = 
0, for all choices of the vectors S, p. Thus, the rank of the estimation matrix 
for the fundamental matrix becomes 6 (instead of 8). Reprojection, however, is 
unaffected by the choice of S, p provided that the pairs p, p' to be reprojected 
arise from the same planar surface that was sampled in the process of recovering 
Fi 3 and F13: 

Fi3P X F 23 P' = {[5]xBp) X i[p]xBA~'^p') 

= {5x p") x{px p") ^ p" 

Note that unlike the trifocal tensor estimation that requires a triplet p,p' ,p” 
of matching points in the estimation process, here the requirement is pairs of 
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matching pairs p,p” and p' ,p" that do not necessarily arise from the same point 
in 3D. This raises the possibility, for example, that F23 is estimated from a 3D 
scene, yet is estimated from a planar scene. The process of reprojection would 
remain valid nevertheless provided that the points p,p' used for reprojection arise 
from a surface whose dimensionality is lesser or equal to the dimensionalities of 
the surfaces used for estimation 0/F13 and F 23 . 

4 Sensitivity Analysis of “thin” Volumes 

We have seen in the previous section that the ambiguity of the tensor estimation 
in the presence of a planar configuration of points does not affect the reprojection 
process of points coming from the planar surface. In this section we wish to 
investigate the reprojection process for “thin” volumes — the point configuration 
does not form a 2D plane but almost does so (shallow surface, aerial photograph, 
for example). Strictly speaking, a point configuration can be either 2D (plane) 
or 3D (non-coplanar), there is no in-between. But, in practice it is important 
to investigate the (numerical) sensitivity of the reprojection process in order to 
be convinced that the transition between planar and 3D is a continuous one. 
In other words, we would like to establish the fact that the estimation of the 
trifocal tensor, from a point configuration that spans any volume in 3D space, 
will produce a valid reprojection of that volume. 

We wish to show that all tensors that can be recovered from a ’’thin” volume 
are equal to the first order. To do so, think of a ’’thin” volume as two planes 
infinitesimally separated (to be defined later). We will show that any form of 
indeterminacy of the epipoles (whether complete or partial) leads to at most a 
second order error in the infinitesimal variables — hence can be neglected. In 
other words, we will employ infinitesimal calculus (see | 2 |) of the first order in 
our investigation, such that if e is an infinitesimal variable in a calculation, then 
= 0 (and higher orders). 

We will first consider the estimation of the trifocal tensor from a point con- 
figuration arising from two distinct planes 7r,7r. Let A, B be the homography 
matrices due to tt from views 1 to 2 and from views 1 to 3, respectively, and let 
A, B he the homographies due to tt. Then, there exist A, n that satisfy: 

A = XA + v'n^ 

B = XB + v'nJ ^ 

where v' ^v" are the epipoles in views 2,3 respectively (projection of the first 
camera center onto views 2,3). The vector n is the projection on view 1 of the 
intersecting line between tt, tt and (n^,A) is the plane passing through the first 
camera center and the line n in view 1. Let the space of solutions to the trifocal 
tensor arising from matching triplets corresponding to tt be 

- p^ai 

where <5, p are free vectors, and let the space of solutions arising from tt be 
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where <5, p, are free vectors. The space of solutions arising from measurements 
corresponding to both tt, tt is the intersection of the null spaces, i.e., we wish to 
find S, 6, p that satisfy 

= PVl - pP4 

After rearranging terms: 

{5^ - XP)4 - - Xp44 = n.iPv"'^ - p^v'4. 

Since the left-hand side is at least a rank-4 tensor (A, B cannot be lower than 
rank-2) and the right-hand side is a rank-2 tensor, equality can hold only if 
S = XS and fj, = Xp. Thus, S, p must satisfy 

- p'^v'^ = 0 , 

which could happen if and only if 5 = av' and p = av" for all a. Taken together, 
the intersection of the null spaces is a unique tensor: 

v'04 - v">^ai. 

The derivation above is simply another route for proving the existence and form 
of the trifocal tensor from image measurements arising from points matches 
corresponding to a 3D set of points. However, it is shown that two planes are 
sufficient for a unique determination (two distinct planes and the camera center 
of view 1 forms a simplex) . Analogously, the fundamental matrix between views 
1,2 is known to be uniquely determined from the relationship: F + A = 0 

and A^ F + F^ A = 0 and the proof follows the same lines as above. 

We will use this line of derivation of the trifocal tensor to consider next the 
situation where the two planes tt, tt are infinitesimally separated. This is defined 
by letting A, B be defined as: 



A= XA + dA 
B = XB + dB 

where dA, dB are matrices whose entries are infinitesimal to the first order, 
i.e., higher orders of these variables can be neglected. Because dA,dB may be 
arbitrary (i.e., v' , v" are completely masked out in the presence of noise) the null 
spaces may not have a common intersection. But, instead of an intersection we 
are looking for 5, /i, 5, p such that the null spaces have a common infinitesimal 
locus, i.e., a locus that is defined by second (or higher) order terms of dA,dB. 
In other words, let F{S,iJ,) be the space of tensors (null space) of the form 
P4 — and let T(5,/i) be the space of tensors 6^4 — then we are 

looking for <5, /2 such that T{6,p) — 'T{5,p) =inf 0 where the symbol =m/ 
denotes equality up to second order terms of infinitesimal variables. 

Let S = XS and /x = Xp, then <5, p must satisfy 

Pd4 - p^dal =„/ 0 . 
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Let S take some linear combination of the rows of dA and let fj, be equal to some 
linear combination of the rows of dB (see Appendix for proof that such a choice 
corresponds to an L 2 norm minimization of the expression above). Then, the 
expression above involves bilinear products of infinitesimal variables — thus the 
equality of the first order is achieved, i.e. T{6, /i) — T{6, p.) =inf 0 for the choices 
we made. The theorem below summarizes the findings above: 

Theorem 2. In the case where the trifocal tensor is estimated from point mat- 
ches coming from an infinitesimally thin volume in space, then in the worst case 
condition (measurement noise completely masks out the location of the epipoles 
v' ,v" ), the solutions in the null space are valid for reprojection of points of the 
sampled volume — upto a measure zero of infinitesimal variation. 

5 Experiments 

We show results on three real image sequences. In all cases we use a progressive 
scan Canon ELURA DV cam-corder that produces RGB images of size 720 x 480 
pixels. We use the KLT package P to automatically detect and track a list 
of interest points throughout the sequence (about 100 points on average). We 
then estimate the trifocal tensor on successive triplets of frames, reprojecting 
points from the first two frames in the triplet to the third. The trifocal tensor is 
estimated using the method described in this paper with the usual LMeDS (Least 
Median of Squares) im. Specifically, the algorithm proceeds as follows. Sets 
of seven points are sampled randomly from the set of all matching points. The 
estimation matrix is constructed and the tensor is taken to be the eigenvector 
that corresponds to the smallest eigenvalue. Then we measure the reprojection 
error of the recovered tensor for the rest of the points and take the median of 
the reprojection error as the score of this tensor. The process is repeated for 50 
times. The tensor with the lowest score is the winner. We then recompute the 
tensor, using the same method, but now with all the points whose reprojection 
error is lower than the score of this tensor. 



Experiment 1 We move the camera from a ’’volumetric” scene to a very shal- 
low scene gathering 36 images as we move. We compute the trifocal tensor of 
successive triplets of images and reprojected the points in the first two images 
to the third. Figure [Dshows some of the 36 images, with the tracked and repro- 
jected points super-imposed. The average reprojection error is about 0.5 pixels. 
More interestingly, we plotted the average reprojection error across the 36 ima- 
ges and did not find a clear correlation between the reprojection error and the 
’’thickness” of the 3D scene. Recall, the camera is moving from a full 3D scene 
to a very shallow scene. 
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Fig. 1. ((a),(b),(c)) First, middle and last images in a 36 long image sequence. White 
circles represent the tracked points. Black crosses represent the reprojected points. 
Average reprojection error is 0.5 pixels, (d) shows reprojection error, in pixels, across 
the 36 images of sequence. There is no clear correlation between the reprojection error 
and the volume of the 3D scene. Note that the camera is moving from a full 3D scene 
to a very shallow 3D scene. 
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Experiment 2 This experiment is similar to the previous one, only this time 
the camera moves from a planar scene to a full volume 3D scene, gathering 26 
images as it moves. Again, we compute the trifocal tensor of successive triplets 
of images and reprojected the points in the first two images to the third. Figure 
0 shows some of the 26 images, with the tracked and reprojected points super- 
imposed. The average reprojection error is about 0.5 pixels. Again, we plotted 
the average reprojection error across the 26 images and did not find a clear 
correlation between the reprojection error and the ’’thickness” of the 3D scene. 
Recall, this time the camera is moving from a planar scene to a full 3D scene. 





Fig. 2. ((a),(b)) First and last in a sequence of 26 images. White circles represent the 
tracked points. Black crosses represent the reprojected points. Average reprojection 
error is 0.5 pixels, (c) shows reprojection error, in pixels, across the 26 images of 
sequence. There is no clear correlation between the reprojection error and the volume 
of the 3D scene. Note that the camera is moving from a planar scene to a full 3D scene. 



Experiment 3 In this experiment we demonstrate the reprojection power of 
our method given the same camera configuration but using different sections of 
the 3D scene. We repeated the experiment twice on the same triplet of images, 
once using all the points in the scene and once using only points on a plane. 
The results are shown in Figure El White circles represent the tracked points. 
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Black crosses represent the reprojected points. In the first case the scene has a 
large volume and our method has no problem reprojecting the points with an 
average error of 0.5 pixels. In the second case we manually deleted all the points 
outside a specific plane and ran the algorithm again. The reprojection now was 
0.2 pixels. 




Fig. 3. Original images ((a),(b)) are reprojected to the third image using all the points 
in the scene (c) or only the points on the plane (d) (Images (c) and (d) show the third 
image with the different point configurations). The reprojection error is 0.5 pixels for 
for image (c) and 0.2 pixels for image (d). 



6 Summary 

We have shown, in this paper, that the ambiguity in recovering multi-linear con- 
straints from planar scenes is orthogonal to tasks such as reprojection. Thus, it 
is not necessary to choose a different model for different scenes (Homography for 
2D scenes or trifocal tensor/fundamental matrix for 3D scenes) as the ambiguity 
in the recovered parameters does not affect our ability to perform reprojection. 
Moreover, in the case of a ’’thin” volume which is not 2D nor 3D, our method will 
generate a tensor that is provably correct for reprojecting all the points within 
this volume. We thus have a unified method for reprojecting planar, ’’thin” and 
full volume scenes. Finally, while the results we have shown are relevant to the 
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process of reprojection, we believe that they can be used in some reconstruction 
situations as well, but leave it for future research. 

A Appendix 

Consider the expression 

I xb^ - ay^ |l2 + \xc^ - dy^ |l, + \ xe^ - fy^ |l, 

where a,b,c,d,e, f are vectors, and \\l2 stands for the L2 norm of a matrix 
defined by the sum of squares of the matrix entries. The vectors cc, y that bring 
the expression to minimum are described by x = 010 + 02^+03/ and y = 
fdib + P2C + for some coefficients Oi,/ 3 j, i = 1 , 2 , 3 . The derivation is as 
follows. 

Since | A \l 2= trace{A^ A) = trace(AA^), then the trace of the expression 
above is 



{b^ b + c^c + e^e)(a;^a;) — 2 {b^y){x^ a) — 2 {c^y){x^d) 
- 2 {e^y){x^f) + (o^o + d^d + f^f){y^y) 

The partial derivatives with respect to x and y are therefore 

^ = {b^b + c^c + e^e)x - {b^ y)a - {c^y)d - {e^y)f = 0 
^ = (o^o + d^d + f)y — {x^a)b — {x^ d)c — {x^ f)e = 0. 
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